StevenAndrewCulpepper TheReliabilityandPrecisionofTotalScoresandIRTEstimatesasaFunctionofPolytomousIRTParametersandLatentTraitDistribution

25  Download (0)

Full text



Applied Psychological Measurement 37(3) 201–225 Ó The Author(s) 2013 Reprints and permissions: DOI: 10.1177/0146621612470210

The Reliability and Precision of Total Scores and IRT

Estimates as a Function of Polytomous IRT Parameters and Latent Trait Distribution

Steven Andrew Culpepper1


A classic topic in the fields of psychometrics and measurement has been the impact of the num- ber of scale categories on test score reliability. This study builds on previous research by fur- ther articulating the relationship between item response theory (IRT) and classical test theory (CTT). Equations are presented for comparing the reliability and precision of scores within the CTT and IRT frameworks. This study presented new results pertaining to the relative precision (i.e., the test score conditional standard error of measurement for a given trait value) of CTT and IRT, and the new results shed light on the conditions where total scores and IRT estimates are more or less precisely measured. The relative reliability of CTT and IRT scores is examined as a function of item characteristics (e.g., locations, category thresholds, and discriminations) and subject characteristics (e.g., the skewness and kurtosis of the latent distribution). CTT total scores were more reliable when the latent distribution was mismatched with category thresh- olds, but the discrepancy between CTT and IRT declined as the number of scale categories increased. This article also considered the appropriateness of linear approximations of polyto- mous items and presented circumstances where linear approximations are viable. A linear approximation may be appropriate for items with two response options depending on the item discrimination and the match between the item location and latent distribution. However, linear approximations are biased whenever items are located in the tails of the latent distribution and the bias is larger for more discriminating items.


reliability, scale construction, classical test theory, item response theory, polytomous items, information function

The impact of the number of scale categories on reliability is a classic topic in psychology (Symonds, 1924), and an extensive body of research has examined the effect of the number of

1University of Illinois at Urbana–Champaign, USA Corresponding Author:

Steven Andrew Culpepper, University of Illinois at Urbana–Champaign, 116D Illini Hall, MC-374, 725 South Wright Street, Champaign, IL 61820, USA.



scale categories on the corresponding reliability of total scores (x). In fact, researchers exam- ined the impact of the number of scale categories on total score reliability using empirical data (Adelson & McCoach, 2010; Bendig, 1954; Chafouleas, Christ, & Riley-Tillman, 2009;

L. Chang, 1994; Komorita & Graham, 1965; Matell & Jacoby, 1971; Weng, 2004), Monte Carlo simulations (Aguinis, Pierce, & Culpepper, 2009; Bandalos & Enders, 1996; Cicchetti, Shoinralter, & Tyrer, 1985; Enders & Bandalos, 1999; Greer, Dunlap, Hunter, & Berman, 2006; Jenkins & Taber, 1977; Lissitz & Green, 1975), and analytic derivations (Krieg, 1999).

Methodological developments have bridged the concept of reliability between the item response theory (IRT) and classical test theory (CTT) frameworks and discussed concepts that have traditionally been reserved for IRT (e.g., item information functions [IIFs]) within the con- text of CTT. The purpose of this article is to understand the circumstances where researchers should prefer estimating true scores (i.e., u) with test scores derived from IRT (i.e., ^u) versus CTT (i.e., x). This article compares the precision and reliability of ^u and x, and equations are presented for investigating how item and subject characteristics affect the reliability of ^u and x.

Mellenbergh (1996) noted that reliability is a population-dependent quantity that is affected by characteristics of latent distributions, whereas the conditional standard error of measurement (CSEM) quantifies error variance (i.e., precision or the inverse of information) for a specific u value. New equations are presented that compare the relative precision of ^u and x. The results show that IRT and CTT have CSEMs that are roughly the mirror image across values of u. That is, ^u is measured more precisely in portions of the u continuum that include relatively more category thresholds, whereas x tends to be measured more precisely for u values that are further from category thresholds.

It is important to articulate the contributions of this article to existing research. First, method- ological advances concerning dichotomous items established a link between the IRT and CTT frameworks by articulating the reliability of total scores and percentile ranks with corresponding IRT item parameters, such as item difficulty, discrimination, and guessing parameters (Bechger, Maris, Verstralen, & Beguin, 2003; Dimitrov, 2003; Kolen, Zeng, & Hanson, 1996; May &

Nicewander, 1994). Additional research has studied the impact of IRT item parameters on the reliability of gain scores (May & Jackson, 2005), the lower and upper bounds of the IRT relia- bility coefficient (Kim & Feldt, 2010), the reliability of scale scores and performance assess- ments using polytomous IRT (Wang, Kolen, & Harris, 2000), and the reliability of subscores using unidimensional (Haberman, 2008) and multidimensional IRT (Haberman & Sinharary, 2010). Researchers have also found that IRT scores provide more accurate estimates of interac- tion effects within the contexts of analysis of variance (Embretson, 1996) and multiple regres- sion (Kang & Waller, 2005; Morse, Johanson, & Griffeth, 2012). This article offers new information about the connection between the concepts of reliability and precision in CTT and IRT. That is, no study has provided theoretical results about the relative precision of ^u and x for a given u value. The new derivations provide theoretical rationale for circumstances when CTT scores include relatively more or less measurement error than IRT scores. Moreover, no research study has analytically studied the interactive effect of item characteristics and latent distribution shape on the relative reliability of ^u and x. This article uses Fleishman’s (1978) power transfor- mation (PT) method probability density function to study the impact of nonnormal latent distri- butions on the reliability of total scores.

Second, additional research has extended the concept of item and test information to CTT under the assumptions that observed measurements are continuous, rather than polytomous, and u is linearly related to x (Ferrando, 2002, 2009; McDonald, 1982; Mellenbergh, 1996).

Ferrando notes that a linear model provides a good approximation when item discrimination indices are relatively small in value and coarsely measured items have five or more response categories. Moreover, Ferrando and Mellenbergh showed that, for the linear model, the CTT


IIF is horizontal and unrelated to u. However, whenever the relationship between latent and observed total scores is nonlinear (which occurs when items are polytomous), the CTT IIFs are no longer unrelated to u. In fact, this article shows that the conditional standard error of x given u is a downward-facing function where scores near category thresholds have the least amount of precision. The accuracy of a linear relationship is also explored, and the results in this article examine the effect of test characteristics (e.g., item locations and discrimination) and subject characteristics (e.g., latent distribution shape) on the appropriateness of linear approximations of polytomous items.

Third, previous Monte Carlo studies that have studied the relative performance of CTT and IRT estimates are limited by the combination of parameter values and type of reliability stud- ied. For instance, Greer et al. (2006) studied the impact of skewness on coefficient alpha with the constraint that item variances were equal, which may not occur frequently in practice. The results in this article can be used to study any combination of IRT parameter values and latent distribution shape and provide more general results than previous Monte Carlo simulations. In addition, Wang et al. (2000) presented equations for computing the reliability of scale scores from performance assessments using the generalized partial credit model. However, unlike Wang et al., this study presents equations for evaluating how the number of items, number of scale categories, and the shape of the latent distribution affects the reliability of ^u and x.

Furthermore, R code (R Development Core Team, 2010) is available at http://publish.illinois.

edu/sculpepper/, and researchers can use the R code to compute the reliability of ^u and x for different item characteristics and latent distributions. Consequently, this study presents new results and provides applied researchers with guidance for reliably scoring tests in different situations.

This article includes five sections. The first section presents equations for the reliability and precision of scores within the CTT and IRT paradigms and includes new results about the CSEM for CTT. The second section compares IRT and CTT in terms of CSEMs to provide a general understanding of the circumstances researchers should prefer CTT versus IRT scores.

The third section compares the relative reliability of ^u and x for different item (e.g., item loca- tions and number of response categories) and subject distribution characteristics (e.g., the skew- ness and kurtosis of u), and the fourth section examines how item and subject characteristics affect the appropriateness of a linear approximation of polytomous items. The last section dis- cusses the results and provides recommendations and concluding remarks.

Equations for the Reliability of CTT and IRT Estimates of u

Let u represent a latent variable and ui be an observed polytomous response, where i indexes items (i = 1, . . . , I ). The observed polytomous response for item i can be expressed as a function of a true score (Efuijug) and random error (ei), such that ui= Efuijug + ei. That is, the observed uiequals an item true score Efuijug (May & Nicewander, 1994), which is a nonlinear function of u, plus an error, ei. Let j index category thresholds (j = 1, . . . , J ) and J + 1 is the correspond- ing number of categories for item i. That is, j is used to index categories as well (e.g., J + 1 = 4 implies that uihas four categories). This article assumes that researchers code uiusing integers from 1 to J + 1. Several polytomous models exist to describe the relationship between u and the chance that ui equals one of J + 1 categories, which include the graded response (Molenaar, Dolan, & De Boeck, 2012; Muraki, 1990; Samejima, 1969), partial credit (Masters, 1982;

Muraki, 1992, 1993), and rating scale (Andrich, 1978a, 1978b) models. The derivations in this article use Muraki’s (1990) modified graded response model,

Culpepper 203


Pijðui. jjuÞ =

1, j = 0


1 + exp½aiðu bðicjÞÞ, 0\j J 0, j = J + 1 8<

: , ð1Þ

where biand aiare the item difficulty and discrimination parameters, respectively, and cjis the jth threshold, which is equal for all I items. Equation 1 represents the chance that ui. j given u and the probability that item i equals the jth category is

Pijðui= jjuÞ = Pijðui. jjuÞ  Pij + 1ðui. j + 1juÞ: ð2Þ Note that Muraki’s model was chosen because the form of Equation 1 is easier to manipulate analytically and the derivations of expressions for derivatives are less cumbersome.

Muraki’s model assumes that the J cjare constant across items and are equally spaced. The goal of this article is not to estimate abilities or item parameters with Muraki’s model, and these assumptions can be relaxed to evaluate the impact of unequal item thresholds on the reliability and precision of x and ^u. In fact, the new expressions are applicable for any polytomous model, and Muraki’s model is only used for computational examples. Furthermore, the aforementioned models tend to yield scores that are highly correlated (Embretson & Reise, 2000), so we should not anticipate the results would change significantly if the partial credit or rating scale models were used.

The derivations below require the specification of a distribution for u. In this article, u is assumed to follow a Fleishman PT distribution, j(ujΩ), where Ω = (m, s2, k3, k4) to indicate that u has a mean m, variance s2, and skewness and kurtosis of k3 and k4, respectively (j(ujΩ) is discussed in greater detail in the Appendix). Note that any univariate distribution could be chosen for u. One advantage of using Fleishman’s PT distribution is that it is flex- ible enough to explore of how changes in m, s2, k3, and k4 affect the relative reliability of IRT and CTT u estimates. However, as noted by an anonymous reviewer, researchers often set s2= 1 to estimate item discriminations (i.e., the scale indeterminacy problem).

Accordingly, the results in this article also use s2= 1 to understand how manipulating item discriminations affects reliability. Fleishman’s PT distribution does not encompass the uni- verse of all univariate distributions, so future researchers can modify the associated R code to examine the reliability of CTT and IRT estimates when u follows other distributions.

Furthermore, the discussion below provides an argument as to how reliability within the CTT and IRT frameworks is dependent on the match between the shape of j(ujΩ) and the topography of the conditional variance of x and ^u.

Reliability and Precision Within IRT Framework

One of the strengths of IRT over CTT relates to the well-known measure of precision for esti- mated trait scores, ^u. Reliability is specific to a group and is a function of the unconditional standard error of measurement (SEM). CTT has traditionally calculated the SEM for a group of scores, whereas the CSEM is a measure of precision that corresponds to a specific trait level within the IRT framework. The CSEM of ^u is related to the test information function (TIF), which is derived using the concept of Fisher’s information to measure the amount of informa- tion that a single observation provides about u. In fact, the inverse of a TIF indicates the var- iance of ^u for a given u. Previous research (Muraki, 1993; Samejima, 1994) discussed TIFs for polytomous IRT models and noted that the IIF for item i is


Iið Þ = u XJ + 1

j = 1

Pijðui= jjuÞ ∂2

∂u2ln P ijðui= jjuÞ

=XJ + 1

j = 1

∂Pijðui= jjuÞ


h i2

Pijðui= jjuÞ∂2Pijðui= jjuÞ

∂u2 2


3 75:


For Muraki’s (1990) modified graded response model, Pij(ui= jju) is a function of Pij(ui. jju), which is a logistic function. The first two derivatives of Pij(ui. jju) are as follows:

∂Pijðui. jjuÞ

∂u = aiPijðui. jjuÞ 1  Ph ijðui. jjuÞi ,

2Pijðui. jjuÞ

∂u2 = ai

∂Pijðui. jjuÞ

∂u h1 2Pijðui. jjuÞi

= a2iPijðui. jjuÞ 1  Ph ijðui. jjuÞi

1 2Pijðui. jjuÞ

h i



Accordingly, Ii(u) can be computed using the first and second derivatives of the item cate- gory probabilities, Pijðui= jjuÞ, which are as follows:

∂Pijðui= jjuÞ

∂u =∂Pijðui. jjuÞ

∂u ∂Pij + 1ðui. j + 1juÞ

∂u ,

2Pijðui= jjuÞ

∂u2 =∂2Pijðui. jjuÞ

∂u2 ∂2Pij + 1ðui. j + 1juÞ

∂u2 :


The TIF for a test of polytomous items is the sum of the respective IIFs:

TIF(u) =XI

i = 1

Ii(u): ð6Þ

The relationship between true and observed IRT estimates can be written as ^u = u + e. If u and e are independent, the variance of ^u is the sum of the true and error variances, s2f^ug = s2fug + s2feg. The conditional variance of ^u given u is defined as s2f^ujug = s2fejug = (TIF(u))1. The expected conditional variance of ^u for a specific distribution of u, j(ujΩ), is as follows:

E s 2^uju

= ð


s2^ujuj ujΩð Þdu: ð7Þ

The expected reliability of ^u is the well-known ratio of true to observed variance, r^u^u= s2

s2+ E s 2^uju , ð8Þ

where s2is the variance of u specified in j(ujΩ). Clearly, r^u^u is dependent on the characteris- tics of the test (i.e., s2f^ujug) and the distribution of latent scores (i.e., j(ujΩ)).

Culpepper 205


Reliability and Precision Within CTT Framework

Let x be the total score, or sum, of the I ui, such that ui= Efuijug + ei. For a subject with a given u, uiequals one of J + 1 categories each with probability Pij(ui= jju), so Efuijug is

E uf ijug = XJ + 1

j = 1

jPijðui= jjuÞ: ð9Þ

That is, Efuijug is a weighted average of the category scores (i.e., j = 1 to J + 1) and the chance that subjects with a specific u have an observed score ui. The previous section discussed an expression for the expected conditional variance of ^u as a function of the inverse TIF and distri- bution of u. This section analogously derives an expression for Efs2fxjugg, which is the expected variance of x for a given value of u. An equation for the reliability of x is presented as a function of Efs2fxjugg and the variance of the expected true scores, s2fEfuijugg. This sub- section proceeds by first deriving an expression for the conditional error variance of x, Efs2fxjugg, and then identifies an equation for the true score variance, s2fEfuijugg.

Consider item i where the error is ei= ui Efuijug. Note that ui is a coarse measure of Efuijug in that uiis ordinal and equals one of the J + 1 values, whereas u is measured on an inter- val scale. One immediate observation is that Efeijug = 0. Recall that uiis a polytomous item, so ei= j Efuijug for j = 1 to J + 1. The conditional expectation of the error within item i is

E ef ijug =XJ + 1

j = 1

Pijðui= jjuÞ j  E uð f ijugÞ =XJ + 1

j = 1

jPijðui= jjuÞ  E uf ijug = 0: ð10Þ

The variability of observed uiaround expected values is determined by the error variance or precision; that is, s2fuijug = Efe2ijug and is defined as

s2fuijug = Enðui E uf ijugÞ2juo

=XJ + 1

j = 1

Pijðui= jjuÞ j  E uð f ijugÞ2: ð11Þ

Equation 11 is new to the literature and the following sections compare the properties of s2fuijug with s2f^ujug.

s2fuijug is the conditional variance for a single item and an expression is needed for the con- ditional variance of x. An important observation is that errors within two polytomous items, say, uiand uh(let categories for uhbe indexed by k), are independent whenever the items are locally independent, which assumes that P(ui= j, uh= kju) = Pij(ui= jju)Phk(uh= kju). Specifically, the covariance between ehand eiconditioned on u is

s ef i, ehjug = E ufð i E uf ijugÞ uð h E uf hjugÞjug

=XJ + 1

j = 1

XJ + 1

k = 1

j E uf ijug

ð Þ k  E uð f hjugÞP uð i= j, uh= kjuÞ

=XJ + 1

j = 1

j E uf ijug

ð ÞPijðui= jjuÞXJ + 1

k = 1

k E uf hjug

ð ÞPhkðuh= kjuÞ = 0:


The finding that errors within items are independent, if local independence is assumed, is par- ticularly important, because the conditional error variance of the total score, x, given u is sim- ply the sum of the conditional item variances, s2fuijug.


Recall that x = PI

i = 1ui. Equation 9 implies that the expected value of x given u is E xjuf g = XI

i = 1

E uf ijug =XI

i = 1


J + 1

j = 1

jPijðui= jjuÞ: ð13Þ

The conditional variance of x given u is the sum of conditional variances for the I ui, under the assumption of local independence (see Equation 12), which implies that s2fxjug =PI

i = 1s2fuijug. The expected conditional variance for a specific distribution of u is E s 2fxjug

= ð


s2fxjugj ujΩð Þdu: ð14Þ

Recall that in IRT the expected variance of maximum likelihood estimates is the expected value of s2f^ujug across the distribution of u. Similarity, the expected conditional variance of x is found by replacing s2f^ujug with s2fxjug. Consequently, researchers can compare the CSEM (i.e., sfxjug and sf^ujug) to understand which values of u are associated with relatively more precision within the CTT and IRT frameworks.

Recall that Efxjug is the expected total score for subjects with a given u and the variance of Efxjug across subjects provides a measure of the amount of true score variance. First, note that the unconditional mean of x is

E xf g = E E xjuf f gg = ð


E xjuf gj ujΩð Þdu: ð15Þ

The variance of Efxjug across u is

s2fE xjuf gg = EnðE xjuf g  E xf gÞ2o

= ð


E xjuf g  E xf g

ð Þ2j ujΩð Þdu: ð16Þ

The reliability of x is the ratio of true to observed variance:

rxx= s2fE xjuf gg

s2fE xjuf gg + E sf 2fxjugg: ð17Þ

Factors That Affect the Precision of CTT and IRT Scores

The previous section derived expressions for the CSEMs for x and ^u (i.e., sfxjug and sf^ujug).

The characteristics of polytomous IRT CSEMs are well understood. For example, sf^ujug tends to be smaller at points along the latent continuum where category thresholds are located and sf^ujug declines in regions where more discriminating items are located. In contrast, the expres- sion for sfxjug is new and the purpose of this section is to compare sfxjug with sf^ujug to pro- vide researchers with a conceptual understanding of the testing situations where x may be preferred to ^u in terms of measurement precision.

As an example, consider a test consisting of three items that each have four response cate- gories. Moreover, let a and b be three-dimensional vectors of item discriminations and loca- tions, such that a = (0.5, 1.5, 2.5) and b = (21, 0, 1). Moreover, to simplify this example, let

Culpepper 207


the category thresholds be equally spaced with values of (21.6, 0, 1.6) units below and above u bi(i.e., the thresholds for the first item are located at 22.6, 21.6, and 0.6 on the u scale).

Figure 1 presents s2fuijug (see Equation 11) and ½Ii(u)1 (see Equation 3) for the three items. The IRT conditional variances exhibit expected behavior. That is,½Ii(u)1 is lowest at the category thresholds and½I3(u)1 is generally smaller than the other two items because the third item is more discriminating. One additional nuance is that½I3(u)1 has local minima at the category thresholds, whereas Items 1 and 2 have inverse information functions that appear smoother. Stated differently, in IRT, items with larger discriminations tend to have IIFs with more topography in regions near category thresholds.

As expected, s2fuijug is smaller for more discriminating items; however, the general beha- vior of s2fuijug is different from ½Ii(u)1. Namely, s2fuijug tends to be a downward-facing function where measurement error is largest at category thresholds. For example, s2fu3jug tends to be the smallest of the three items, but s2fu3jug has maxima near the category thresholds, which differs from IRT where measurements are more precise at category thresholds. Under a Figure 1. Conditional variances of CTT and IRT scored items for a hypothetical three-item test with four response options and equally spaced thresholds, cj= (21.6, 0, 1.6).

Note: CTT = classical test theory; IRT = item response theory.


CTT framework, s2fuijug is smallest for either more extreme u or for u values that lie between category thresholds. In short, s2fuijug is roughly the mirror image of ½Ii(u)1, because½Ii(u)1 tends to be lower in segments of the u continuum where s2fuijug is larger and vice versa.

Figure 2 includes the same items discussed in Figure 1, but with the exception that the thresholds are no longer equally spaced. Figure 2 shows that s2fuijug and ½Ii(u)1 respond inversely to unequal item thresholds. For instance, s2fuijug tends to increase in portions of the latent continuum that include more item thresholds, whereas½Ii(u)1 is smaller in segments of the latent continuum where there are more item thresholds and increases wherever there are fewer thresholds.

Figure 3 demonstrates CTT and IRT test CSEMs along with error bars around conditional expected values, Efxjug and Ef^ujug. The top row of Figure 3 illustrates the CTT and IRT CSEMs for x and ^u, respectively, for the hypothetical three-item test. Note that the vertical lines in the top row of panels in Figure 3 indicate category thresholds for Items 1, 2, and 3. Figure 3 shows that s2fxjug is larger for u values that are near category thresholds, whereas s2f^ujug is smaller at points on the latent continuum that have more thresholds and more discriminating Figure 2. Conditional variances of a CTT and IRT scored items for a hypothetical three-item test with four response options and unequally spaced thresholds, cj= (21.6, 0, 1.0).

Note: CTT = classical test theory; IRT = item response theory.

Culpepper 209


items. For the three-item test, s2f^ujug appears more responsive to item discriminations than s2fxjug as indicated by the fact that the slope of s2f^ujug is steeper than s2fxjug in the u range measured by Item 1.

The second row of panels in Figure 3 plots Efxjug and Ef^ujug as well as 62 times the CSEMs. As discussed previously, Efxjug is a more accurate indicator of u for more extreme values on the u continuum, whereas ^u is a better indicator in the range where items and cate- gories are located. The overall reliability of x and ^u is dependent on the shape and location of the latent distribution. For example, if u;N (0, 1) most subjects lie in the middle range of the latent continuum where x is less precisely measured relative to ^u. In fact, the differences in s2fxjug and s2f^ujug contribute to ^u being significantly more reliable than x (i.e., 0.65 vs. 0.49).

Certainly, rxxand r^u^uwill change depending on the shape and location of the u distribution.

Figure 3. Test score CTT and IRT conditional standard error of measurement and expected value plots with 62 error curves for a hypothetical three-item test with four response categories.

Note: CTT = classical test theory; IRT = item response theory. Thresholds are indicated by dashed-vertical lines for Item 1 (b1= 21, a1= 0.5), Item 2 (b2= 0, a2= 1.5), and Item 3 (b3= 1, a3= 2.5) and cj= (21.6, 0, 1.6). rxx. and r^u^u

were calculated under the assumption that u;N(0, 1).


Figure4.ReliabilityofCTTandIRTscoresofa10-itemtestfordifferentitemlocations,numberofresponsecategories,andsubjectlatentdistributionshapes. Note:CTT=classicaltesttheory;


The Reliability of x and ^u for Item and Subject Characteristics

Figure 3 included results for a simple example to demonstrate the theoretical differences between s2fxjug and s2f^ujug, which are useful pieces of information for developing tests and instruments. That is, s2fxjug and s2f^ujug provide applied researchers with an understanding of ranges of u values that are best measured with x or ^u. Moreover, s2fxjug and s2f^ujug offer researchers information about which measurement framework is most beneficial for various item characteristics and subject populations.

Equations 8 and 17 were used to compare the reliability of x and ^u as a function of the num- ber of scale categories, purpose of measurement (i.e., item locations dispersed along the conti- nuum or clustered in a given region), and u distribution shape. More specifically, Figure 4 includes r^u^u and rxx across scale categories (i.e., 2–10 response options) for three types of item locations and four types of distributions for u. Item discriminations and test length were not manipulated and were fixed at 1.25 and 10, respectively, It is well known that increasing either item discrimination or test length increases reliability, and these parameters were held constant to focus on the other parameters.

To simplify the discussion, the three scenarios assume that items have equally spaced cate- gory thresholds. Specifically, the item category thresholds (i.e., the J cj) were equally spaced between 22.0 and 2.0 on the u bi continuum. Let c be the vector of category thresholds that are defined as c = 2(2J(J + 1)1 1) where J is a vector with elements equal to the integers from 1 to J. For example, the threshold is zero for items with two scale categories (i.e., J + 1 = 2) and items with four scale categories have three thresholds at 21, 0, and 1. Whereas the fol- lowing discussion assumes the thresholds are equally spaced, researchers can input any set of category thresholds into the R code, which is available at the author’s website.

Let b be a vector of item locations and I be a vector that includes integers from 1 to I. The three item location scenarios represent situations where researchers would be interested in mea- suring u values in a narrow range in the lower or upper tails or measuring u values across the latent continuum. The item locations for the three scenarios represent the following uniform distributions: bi;U ( 2:5,  1:5) (i.e., b = 0:5(2I(I + 1)1 1)  2), bi;Uð2:0; 2:0Þ (i.e., b = 2(2I(I + 1)1 1)), and bi;U 1:5, 2:5ð Þ (i.e., b = 0:5(2I(I + 1)1 1) + 2).

As noted, four subject distributions were examined to evaluate how the density of the popula- tion at various points on the latent continuum affected r^u^uand rxx. Specifically, the distributions were negatively skewed (g3= 21.5, g4= 4.0), normal (g3= 0, g4= 0), symmetric and peaked (g3= 0, g4= 4.0), and positively skewed (g3= 1.5, g4= 4.0).

Figure 4 includes 12 panels corresponding to the three item locations and four subject distri- bution types. The middle row of panels in Figure 4 demonstrates that CTT and IRT yield simi- lar reliabilities in situations where tests consist of items that are evenly placed along the latent continuum. Given that s2fxjug is smaller in areas with fewer category thresholds, one explana- tion for the slight advantage of IRT is that CTT CSEMs decline outside of the (22, 2) range of the items where there are fewer subjects in the skewed and symmetric distributions. More pre- cisely, Figure 3 showed that s2f^ujug is significantly larger relative to s2fxjug in portions of the latent continuum that has fewer category thresholds. In contrast, s2fxjug is relatively smoother across u values and tends to decline in segments where there are fewer category thresholds. Furthermore, in the case where items are located across the latent continuum, ^u is understandably more reliable than x when the latent distribution is more peaked (i.e., positive kurtosis). The middle row of Figure 4 also shows the expected positive relationship that increasing scale categories has on reliability for both ^u and x. However, as found in previous research (e.g., see a review of relevant literature in Chafouleas et al., 2009, and Weng, 2004),


the value of an additional response category increases at a decreasing rate and reliability does not increase significantly in value beyond four or five response categories.

In theory, x is expected to be more reliable than ^u when the latent distribution is mismatched with item locations. The top and bottom panels of Figure 4 demonstrate the theoretical results discussed in this article concerning the effect that subject and item characteristics have on CTT and IRT reliability. For instance, CTT is superior to IRT whenever the items are located at the extremes of the latent distribution. Furthermore, the difference between rxxand r^u^uis largest for two category items and approaches zero as the number of categories increase. As discussed ear- lier, IRT is a tool for measuring specific u values more precisely. In fact, the difference between rxxand r^u^uis smallest when the majority of the latent distribution overlaps with item locations, which occurs when items are located in the lower portion of the latent continuum and the latent distribution is positively skewed (e.g., bi;U ( 2:5,  1:5), g3 = 1.5, and g4 = 4.0) or when items are in the upper tail and the distribution is negatively skewed (i.e., bi;U (1:5, 2:5), g3 = 21.5, and g4= 4.0).

In short, the results in this section provide new information about the influence of item and subject characteristics on the relative reliability of x and ^u. In fact, the findings reflect the expected results given the nature of s2fxjug and s2f^ujug. That is, total scores were more reli- able whenever items were mismatched with the location of the latent u distribution. The follow- ing section addresses another important concern related to the accuracy of linear approximations of polytomous items.

Appropriateness of a Linear Approximation of Polytomous Items One approach for modeling polytomous items is to use a linear approximation of the relationship between u and ui (e.g., a common factor model, Culpepper, 2012b). As noted in Equation 13, the relationship between u and uiis nonlinear. Ferrando (2009) and Mellenbergh (1996) noted that a linear relationship tends to provide a reasonable approximation when item discrimination indices are relatively low and there are five or more response categories. The purpose of this section is to explore the appropriateness of linear approximations of polytomous items in greater detail. This section includes two subsections. The first subsection describes a measure for quantifying the appropriateness of a linear approximation, whereas the second sub- section presents findings about the accuracy of a linear approximation for different item charac- teristics (e.g., items locations, number of response categories, and item discriminations) and subject characteristics (e.g., the skewness and kurtosis of the latent distribution).

A Measure for the Appropriateness of a Linear Approximation

Let li(u) be the loading relating latent u to observed uifor a given u. The relationship between u and uifor a specific value of u is the first derivative of Equation 9 with respect to u:

lið Þ =u ∂E uf ijug

∂u = XJ + 1

j = 1

j∂Pijðui= jjuÞ

∂u : ð18Þ

The relationship between uiand u will be larger for some values of u and near zero at the extremes of the u continuum (i.e., portions of the latent continuum where Efuijug plateaus). A linear approximation is dependent on the characteristics of the examinee population. The expected loading for a given population of subjects can be found by averaging li(u) over the distribution of u:

Culpepper 213


E lf ið Þu g = E ∂E uf ijug



= XJ + 1

j = 1

j ð


∂Pijðui= jjuÞ

∂u j ujΩð Þdu: ð19Þ

The function that constrains the relationship between u and uito be linear for a given popula- tion of examinees is

L uf ijug = E uf g + E li f ið Þu g u  mð Þ, ð20Þ where Efuig =Ð

Efuijugj(ujΩ)du, m = Efug, and Lfuijug denote the linear approximation of the relationship between u and ui. Also, note that Lfuijug has the same form as the least squares projection where Lfuijug passes through the centroid for u and ui (m, Efuig) and the relation- ship is the expected slope for a given population of examinees, Efli(u)g.

Let di= ui Lfuijug be the error when predicting the observed item uiwith a linear approxi- mation. Unlike, Efeijug; Efdijug 6¼ 0,

E df ijug = E uf i L uf ijugjug = E uf ijug  L uf ijug: ð21Þ That is, di will be biased for certain values of u but EfEfdijugg = 0 across the u range for a given population. The bias in didoes not translate into larger conditional variances for Lfuijug.

In fact, s2fdijug = s2fuijug,

s2fdijug = Enðdi E df ijugÞ2juo

= Enðui E uf ijugÞ2juo

= s2fuijug: ð22Þ Furthermore, the bias in didoes not introduce dependence among residuals. Let dh= uh Lfuhjug be a residual when linearly approximating uh. The covariance between errors is

s df i, dhjug = E dfð i E df ijugÞ dð h E df hjugÞjug

= E ufð i E uf ijugÞ uð h E uf hjugÞjug = 0, ð23Þ where the last inequality was obtained by recalling that the assumption of local independence implies Efeiehjug = 0 as shown in Equation 12.

Ferrando (2009) described a measure of the appropriateness of a linear model, which is based on restricting observed scores to fall within feasible ranges (e.g., 1 and J + 1 for a polytomous item with J + 1 categories). More precisely, 1\ Lfuijug\ J + 1 implies that a linear approxi- mation is appropriate for u scores within the following range:

1 E uf gi

E lf ið Þu g + m\u\J + 1 E uf gi

E lf ið Þu g + m: ð24Þ Following Ferrando, the floor and ceiling indices for the appropriateness of a linear approxi- mation of ui are d0= (1 Efuig=Efli(u)g) + m and d1= (J + 1 Efxg=Efli(u)g) + m, respec- tively. Ferrando’s measure of appropriateness is the proportion of examinees with u scores in the viable range:

P dð 0\u\d1Þ = ð



j ujΩð Þdu: ð25Þ

Consider the examples presented in Figure 5, which compares Efuijug with Lfuijug for three items with bi= (22, 0, 2) and the following parameter values held constant, ai= 2: for all i, four response categories (J + 1 = 4), category thresholds of c = 21.6, 0, 1.6, andΩ = (0, 1, 0, 0). The


middle panel in Figure 5 includes the case where bi = 0. Figure 5 shows that Efuijug and Lfuijug deviate at the ends of the latent continuum. However, given that the distribution of u is standard normal, nearly all of the subjects (i.e., 98.9%) have approximated scores between dl

and du. The top and bottom rows of Figure 5 include examples of items that are located at 22 and 2, respectively. In fact, when items are located at the extremes of the u continuum P(dl\u\du) = 0.876%, which implies that 12.4% of subjects have approximated scores out- side the viable bounds. The examples in Figure 5 demonstrate that item location and the shape of the latent distribution affect the appropriateness of linear approximations. For instance, Figure 5 provides insight that P(dl\u\du) is larger in value if items are located near the mean of a standard normal distribution.

Impact of Item and Subject Characteristics on the Appropriateness of a Linear Approximation

The previous subsection discussed P(d0\u\d1) as a measure of the appropriateness of a linear model. The purpose of this subsection is to present more general findings about the role of item Figure 5. Hypothetical Efxjug and Lfxjug for items with different locations and four response options.

Note: The minimum and maximum total scores are denoted on the y-axis by 1 and 4. Also, ;N(0, 1), ai= 2.0, and cj= (21.6, 0, 1.6).

Culpepper 215


and subject characteristics on the accuracy of a linear approximation. Specifically, this subsec- tion discusses the appropriateness of a linear approximation of uifor 12 scenarios as defined by three response category scenarios (J + 1 = 2-10) and four u distribution scenarios (e.g., the same distributions examined in Figure 4) with the addition that the item discrimination indices are compared for ai = 0.5, 1.0, 1.5, and 2 and category thresholds are defined as discussed for the scenarios in Figure 4.

Figures 6 and 7 plot P(d0\u\d1) against item locations (i.e., bi) and provide evidence about circumstances where a linear approximation is appropriate. Note that the rows of Figures 6 and 7 correspond to the number of response categories (i.e., J + 1), whereas columns relate to distribution shape. Comparing rows demonstrates that P(d0\u\d1) increases as the number of response options increases; however, the accuracy of a linear approximation does not improve significantly beyond four response categories. For instance, a linear approximation is appropri- ate and P(d0\u\d1) . 0:90 for items with five or more response categories that are located within one standard deviation of the mean. Furthermore, larger item discriminations have a neg- ative effect on P(d0\u\d1) and the number of response categories and item discrimination has an interactive effect where P(d0\u\d1) declines more as aiincreases and there are fewer response categories. Moreover, Figure 6 illustrates that a linear approximation is appropriate for as few as two response options when ai= 0.5 or whenever ai= 1.0 and items are located in the middle of the distribution.

Comparing columns provides an indication of the effect of latent distribution shape and item location on P(d0\u\d1). Specifically, the relationship between biand P(d0\u\d1) reflects the shape of the latent distribution, and P(d0\u\d1) is smallest when biis located in the tails of the latent distribution. For instance, the P(d0\u\d1) curves for the normal distribution appear more bell shaped when the distribution is normal as opposed to peaked (i.e., k3= 0, k4= 4). In fact, P(d0\u\d1) is smaller when u;N (0, 1), ceteris paribus. The relation- ship between item location and P(d0\u\d1) is cubic when the latent distribution is skewed, and P(d0\u\d1) is smaller in skewed distributions in segments where there is less density in the latent distribution (e.g., the right tail if k3=  1:5 and the left tail if k3= 1:5).

Figure 6 also demonstrates the effect of item discrimination on P(d0\u\d1). Ferrando (2009) noted that a linear approximation is most appropriate when ai is smaller. In fact, Ferrando’s recommendations are supported given that a linear approximation is expected to be appropriate for almost all subjects and item locations when ai = 0.50. Figures 6 and 7 also show conditions where linear approximations are appropriate even when items are more discri- minating (e.g., ai= 2). For example, P(d0\u\d1) is larger when J + 1 3 and items are located near the middle of the latent distribution. P(d0\u\d1) declines as aiincreases for all of the scenarios included in Figure 6 albeit at different rates. In short, the results in Figures 6 and 7 imply that linear approximations are best for items that measure u near the central portion of the latent distribution. Furthermore, Lfuijug is least accurate for highly discriminating items, and a linear approximation is inaccurate even if there are more than four response categories when items are located in the tails of the latent distribution.


The findings in this study offer guidance to applied researchers interested in the construction and analysis of polytomous items. In short, this article presented new theoretical results con- cerning item and subject characteristics that affect the relative precision and reliability of x and

^u. This section summarizes the findings for psychometricians and applied researchers and offers concluding remarks.


Figure6.Theproportionofviablescoresfromalinearapproximationofpolytomousitemswithtwotofourresponsecategoriesbyitemlocationand discrimination,andsubjectlatentdistributionshapes. Note:Categorythresholdswereequallyspacedalongthelatentcontinuum.


Figure7.Theproportionofviablescoresfromalinearapproximationofpolytomousitemswithfivetosevenresponsecategoriesbyitemlocation discrimination,andsubjectlatentdistributionshapes. Note:Categorythresholdswereequallyspacedalongthelatentcontinuum.


This article studied the reliability of two alternative test scoring approaches: a CTT total score versus an IRT ^u estimate. The majority of applied researchers in education and psychol- ogy are familiar with x, whereas fewer have knowledge about ^u and IRT. Consequently, it is important to offer applied researchers information about the relative merits of x and ^u. This arti- cle offered additional insights into fundamental differences between x and ^u. One salient factor studied in this article was the interactive effect that item locations and u distribution shape had on rxxand r^u^u. Suppose the purpose of testing is to accept high-scoring examinees into an insti- tution as is the case with examinations for actuarial sciences where the latent distribution is either normal or positively skewed. For instance, the first actuarial exam includes items clus- tered in the upper portion of the u continuum to identify those students who are most competent in the foundations of calculus and probability theory. The scores for examinees who are near the passing cutoff are measured more precisely than those who are in the middle or lower portions of the u distribution as indicated by smaller values for s2f^ujug. In contrast, the results in this article show that s2fxjug is larger near the cutoff score and would be relatively smaller for u values in the lower and middle portions of the examinee distribution. Certainly, decision makers would prefer IRT scoring versus a total score in this instance, because the purpose of measure- ment is to evaluate whether test takers exceed a minimum proficiency level. However, second- ary users of the actuarial test score data should prefer total scores, because, as indicated in Figure 4, x tends to be more reliable for a population of subjects than ^u if items are difficult and located in the tails of the test-taker population distribution. For instance, some secondary users may want to gather validity evidence and correlate examinee scores with other indicators, such as undergraduate or graduate grade point average (Culpepper, 2010; Culpepper & Davenport, 2009), other aptitude tests, or job performance (Aguinis, Culpepper, & Pierce, 2010). In these instances, the validity coefficient associated with ^u would be smaller than the coefficient for x because rxx. r^u^u.

In addition, the results provide evidence that CTT and IRT scoring methods perform simi- larly when the goal of testing is to measure a construct across values of a latent continuum. For example, federal and state testing programs measure what students know in relation to given standards. The results in Figure 4 imply that rxxand r^u^uare similar and testing programs could report total scores, which may be easier to explain to certain stakeholders (e.g., teachers, par- ents, and students).

The results in this article also offer recommendations for the construction of scales that include polytomous items, such as educational or employment performance assessments, beha- vioral ratings, or affective measurements. In either case, researchers may prefer x to ^u if the purpose of instrument development is to conduct correlational research rather than to measure specific trait levels. However, x should probably only be preferred to ^u when researchers fit simple models rather than more complicated interactive models (Embretson, 1996; Kang &

Waller, 2005; Morse et al., 2012).

In addition, the findings in this article provide information about the optimal number of item response options. The results in Figure 4 examined the reliability of total scores and IRT esti- mates across a range of parameter values for the number of scale categories, item locations, and the shape of the latent distribution. The findings in Figure 4 imply that using more than five or six scale categories does not significantly improve the reliability of x or ^u regardless of the shape of the latent distribution or location of items. However, it is important to note that adding an additional scale value had a larger effect on r^u^uthan on rxx.

Another relevant finding for applied researchers (and methodologists) relates to the appropri- ateness of a linear approximation of polytomous items. The results in this article confirm argu- ments in previous research (Ferrando, 2009) that linear approximations are more accurate for less discriminating items and items with more response categories. Additional findings suggest

Culpepper 219


that linear approximations of polytomous items seem appropriate for items that measure trait levels in denser segments of the latent distribution, and linear approximations were least appro- priate for items located in the tails of the u distribution. In contrast to previous research, the results provided new evidence that researchers need to consider item locations, in addition to item discriminations and the number of response categories, when employing linear approximations.

Last, another contribution of this study is the availability of the associated R code. More spe- cifically, researchers can use the R code when designing instruments in an effort to evaluate conditions where total scores and IRT scores are more or less reliable and precise. Furthermore, the R code has pedagogical value, as well, for computational applications of the theoretical results.

There are several directions for future research to build on this study. First, this article offers recommendations for researchers who are interested in using x as a measure of u in applied research. Specifically, designing a reliable x requires the inclusion of items located at the bound- aries of the u range of interest. For instance, if the measurement goal is to distinguish high ver- sus low scorers on some trait, the results pertaining to CTT CSEMs dictate that the items should be located in the middle of the distribution, because low and high scorers will be measured more precisely. Likewise, items should be located around a cutscore (and not at the cut-score) if the purpose of measurement is to make inferences about whether examinees exceed or fall below some minimum proficiency level. The behavior of CSEMs under CTT is counterintuitive, because, in contrast, item design under the IRT framework dictates that developers should write items that are specific to certain u levels and measurement purposes. Additional research is needed to understand differences in optimal test assembly (H. Chang & Ying, 2009) within the IRT and CTT frameworks.

Second, there could be benefits in revisiting some topics in modern IRT within the context of CTT. For instance, there may be new insights available by reexamining topics in CTT, such as computer adaptive testing (H. Chang & Ying, 1996) or equating techniques (Kolen & Brennan, 2004), which could lead to new methodologies, refinements of existing approaches, or other unanticipated discoveries.

Third, researchers could extend the results in this article to understand the impact of using x and ^u. Specifically, researchers use total scores as dependent variables and predictors in every subdiscipline of psychology and education. Despite the widespread use of total scores, few methodological studies have examined the impact of using total scores on the power and Type I error rates of tests that researchers employ (Embretson, 1996; Kang & Waller, 2005; Morse et al., 2012). Furthermore, with the exception of Embretson (1996), previous studies utilized Monte Carlo techniques that are limited by the parameter values studied. Consequently, future analytic explorations could provide additional insights into the effect of using total scores, and future research should accordingly examine the effect that the number of scale categories, the shape of the latent distribution, and IRT parameters have on the performance of commonly used statistical tests (Culpepper, 2012a; Culpepper & Aguinis, 2011).

Fourth, this study examined the theoretical reliability of x and ^u using a polytomous IRT model. As one anonymous reviewer noted, this article addressed reliability from a mathematical perspective and does not consider factors related to subjects’ cognitive decision making. For example, this article did not address issues related to category labels; however, previous research identified a causal effect of scale labels, category position, and rating scale intensity and length on certain observed item characteristics (Dunham & Davison, 1991; Lam & Stevens, 1994; Murphy & Constans, 1987). For instance, existing evidence suggests that scale labels can affect the observed item means, but there is less evidence that manipulating category labels alters observed item variances (L. Chang, 1997; Dunham & Davison, 1991). Moreover,


positive-packed scales tend to affect item means (Dunham & Davison, 1991; Lam & Kolic, 2008), and semantic compatibility of category labels improves reliability (Lam & Kolic, 2008).

Most of the previous literature on category labels used CTT or generalizability theory, and addi- tional empirical research is needed to understand how category labeling decisions affect polyto- mous IRT item parameters (e.g., item locations, discriminations, and category thresholds) and consequently alter the reliability of x and ^u. Future research may identify relationships between category labels, cognitive decision making, and IRT parameters, and the results in this article provide mathematical arguments for describing how subsequent changes in IRT parameters affect test score reliability.

In conclusion, this study presented new results concerning the relative reliability and preci- sion of total scores and IRT scores. The derivations in this article offer the most extensive analy- sis of the reliability of total scores by linking parameters of polytomous IRT models with CTT.

In addition, new equations were discussed that described the CTT CSEM to provide new con- ceptual understanding of differences in the precision of scores estimated within the CTT and IRT frameworks.


Fleishman’s Power Transformation (PT) Method Probability Density Function (PDF) Fleishman (1978) developed a PT method for generating nonnormal univariate random vari- ables. The PT method uses the following function to transform a standard normal random vari- able, y, into a metric with a given k3and k4,

u = f yjvð Þ =X4

r = 1

vryr1, ðA1Þ

where the vector of Fleishman coefficients, v = (v1, v2, v3, v4), is identified so that u has a pre- determined mean, variance, skewness, and kurtosis. Additional research extended Fleishman’s method to higher order PTs (Headrick, 2002) and multivariate circumstances (Headrick &

Sawilosky, 1999; Lyhagen, 2008; Vale & Maurelli, 1983).

One critique of Fleishman’s method was that there was no known probability distribution for variables generated using Fleishman’s technique (Tadikamalla, 1980). Recent research derived the PDF for Fleishman’s PT method (Headrick & Kowalchuk, 2007) and showed that the PDF for Fleishman’s PT method is

j ujΩð Þ = f fð 1ðujvÞÞ

sf0ðf1ðujvÞjvÞ, ðA2Þ where f(u) is the standard normal distribution, f1(ujv) is the inverse of Equation A1 that spe- cifies values of y as a function of u, and f0(yjv) is the derivative of Equation A1 with respect to y. An expression for f1(ujv) can be found using Cardano’s formula for the inverse of a cubic polynomial. The equation for the only real root of f1(ujv) is defined as

y = f1ðujvÞ = q +

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi q2+ rð  p2Þ3

q 1=3


ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi q2+ rð  p2Þ3 q


+ p, ðA3Þ

where the second cube root is the absolute value of q

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi q2+ (r p2)3 q

. Furthermore, p, q, and r are a function of v and u as shown in the following:

Culpepper 221


p =  v3 3v4,

q = p3+v2v3 3v4v1ums 

6v24 ,

r = v2 3v4

: ðA4Þ

This article uses j(ujΩ) to understand the impact of nonnormal latent distributions on the reliability of x and ^u. Specifically, v is first identified for a given k3 and k4, and the p, q, and r in Equation A4 are computed to find f1(ujv) to use in Equation A3.

It is important to note that Fleishman’s PT method yields a valid PDF only when u is a mono- tonically increasing function of y (i.e., f0(yjv) must be positive for all values of y). Headrick and Kowalchuk (2007) proved that the PT method produces a valid PDF if the Fleishman coeffi- cients satisfy the following constraints:

v4.1 5

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 5 + 7v22

3 r

2 5v2, 0\v2\1:


Consequently, Fleishman’s PT method can only be used to study the impact of nonnormal latent distributions whenjk3j\4:5.

Declaration of Conflicting Interests

The author declared no potential conflicts of interest with respect to the research, authorship, and/or publi- cation of this article.


The author received no financial support for the research, authorship, and/or publication of this article.


Adelson, J. L., & McCoach, D. B. (2010). Measuring the mathematical attitudes of elementary students:

The effects of a 4-point or 5-point Likert-type scale. Educational and Psychological Measurement, 70, 796-807.

Aguinis, H., Culpepper, S. A., & Pierce, C. A. (2010). Revival of test bias research in preemployment testing. Journal of Applied Psychology, 95, 648-680.

Aguinis, H., Pierce, C. A., & Culpepper, S. A. (2009). Scale coarseness as a methodological artifact:

Correcting correlation coefficients attenuated from using coarse scales. Organizational Research Methods, 12, 623-652.

Andrich, D. (1978a). Application of a psychometric model to ordered categories which are scored with successive integers. Applied Psychological Measurement, 2, 581-594.

Andrich, D. (1978b). A rating formulation for ordered response categories. Psychometrika, 43, 561-573.

Bandalos, D. L., & Enders, C. K. (1996). The effects of nonnormality and number of response categories on reliability. Applied Measurement in Education, 9, 151-160.

Bechger, T. M., Maris, G., Verstralen, H. H., & Beguin, A. A. (2003). Using classical test theory in combination with item response theory. Applied Psychological Measurement, 27, 319-334.




Related subjects :