Article

Applied Psychological Measurement 37(3) 201–225 Ó The Author(s) 2013 Reprints and permissions:

sagepub.com/journalsPermissions.nav DOI: 10.1177/0146621612470210 apm.sagepub.com

### The Reliability and Precision of Total Scores and IRT

### Estimates as a Function of Polytomous IRT Parameters and Latent Trait Distribution

Steven Andrew Culpepper^{1}

Abstract

A classic topic in the fields of psychometrics and measurement has been the impact of the num- ber of scale categories on test score reliability. This study builds on previous research by fur- ther articulating the relationship between item response theory (IRT) and classical test theory (CTT). Equations are presented for comparing the reliability and precision of scores within the CTT and IRT frameworks. This study presented new results pertaining to the relative precision (i.e., the test score conditional standard error of measurement for a given trait value) of CTT and IRT, and the new results shed light on the conditions where total scores and IRT estimates are more or less precisely measured. The relative reliability of CTT and IRT scores is examined as a function of item characteristics (e.g., locations, category thresholds, and discriminations) and subject characteristics (e.g., the skewness and kurtosis of the latent distribution). CTT total scores were more reliable when the latent distribution was mismatched with category thresh- olds, but the discrepancy between CTT and IRT declined as the number of scale categories increased. This article also considered the appropriateness of linear approximations of polyto- mous items and presented circumstances where linear approximations are viable. A linear approximation may be appropriate for items with two response options depending on the item discrimination and the match between the item location and latent distribution. However, linear approximations are biased whenever items are located in the tails of the latent distribution and the bias is larger for more discriminating items.

Keywords

reliability, scale construction, classical test theory, item response theory, polytomous items, information function

The impact of the number of scale categories on reliability is a classic topic in psychology (Symonds, 1924), and an extensive body of research has examined the effect of the number of

1University of Illinois at Urbana–Champaign, USA Corresponding Author:

Steven Andrew Culpepper, University of Illinois at Urbana–Champaign, 116D Illini Hall, MC-374, 725 South Wright Street, Champaign, IL 61820, USA.

Email: sculpepp@illinois.edu

scale categories on the corresponding reliability of total scores (x). In fact, researchers exam- ined the impact of the number of scale categories on total score reliability using empirical data (Adelson & McCoach, 2010; Bendig, 1954; Chafouleas, Christ, & Riley-Tillman, 2009;

L. Chang, 1994; Komorita & Graham, 1965; Matell & Jacoby, 1971; Weng, 2004), Monte Carlo simulations (Aguinis, Pierce, & Culpepper, 2009; Bandalos & Enders, 1996; Cicchetti, Shoinralter, & Tyrer, 1985; Enders & Bandalos, 1999; Greer, Dunlap, Hunter, & Berman, 2006; Jenkins & Taber, 1977; Lissitz & Green, 1975), and analytic derivations (Krieg, 1999).

Methodological developments have bridged the concept of reliability between the item response theory (IRT) and classical test theory (CTT) frameworks and discussed concepts that have traditionally been reserved for IRT (e.g., item information functions [IIFs]) within the con- text of CTT. The purpose of this article is to understand the circumstances where researchers should prefer estimating true scores (i.e., u) with test scores derived from IRT (i.e., ^u) versus CTT (i.e., x). This article compares the precision and reliability of ^u and x, and equations are presented for investigating how item and subject characteristics affect the reliability of ^u and x.

Mellenbergh (1996) noted that reliability is a population-dependent quantity that is affected by characteristics of latent distributions, whereas the conditional standard error of measurement (CSEM) quantifies error variance (i.e., precision or the inverse of information) for a specific u value. New equations are presented that compare the relative precision of ^u and x. The results show that IRT and CTT have CSEMs that are roughly the mirror image across values of u. That is, ^u is measured more precisely in portions of the u continuum that include relatively more category thresholds, whereas x tends to be measured more precisely for u values that are further from category thresholds.

It is important to articulate the contributions of this article to existing research. First, method- ological advances concerning dichotomous items established a link between the IRT and CTT frameworks by articulating the reliability of total scores and percentile ranks with corresponding IRT item parameters, such as item difficulty, discrimination, and guessing parameters (Bechger, Maris, Verstralen, & Beguin, 2003; Dimitrov, 2003; Kolen, Zeng, & Hanson, 1996; May &

Nicewander, 1994). Additional research has studied the impact of IRT item parameters on the reliability of gain scores (May & Jackson, 2005), the lower and upper bounds of the IRT relia- bility coefficient (Kim & Feldt, 2010), the reliability of scale scores and performance assess- ments using polytomous IRT (Wang, Kolen, & Harris, 2000), and the reliability of subscores using unidimensional (Haberman, 2008) and multidimensional IRT (Haberman & Sinharary, 2010). Researchers have also found that IRT scores provide more accurate estimates of interac- tion effects within the contexts of analysis of variance (Embretson, 1996) and multiple regres- sion (Kang & Waller, 2005; Morse, Johanson, & Griffeth, 2012). This article offers new information about the connection between the concepts of reliability and precision in CTT and IRT. That is, no study has provided theoretical results about the relative precision of ^u and x for a given u value. The new derivations provide theoretical rationale for circumstances when CTT scores include relatively more or less measurement error than IRT scores. Moreover, no research study has analytically studied the interactive effect of item characteristics and latent distribution shape on the relative reliability of ^u and x. This article uses Fleishman’s (1978) power transfor- mation (PT) method probability density function to study the impact of nonnormal latent distri- butions on the reliability of total scores.

Second, additional research has extended the concept of item and test information to CTT under the assumptions that observed measurements are continuous, rather than polytomous, and u is linearly related to x (Ferrando, 2002, 2009; McDonald, 1982; Mellenbergh, 1996).

Ferrando notes that a linear model provides a good approximation when item discrimination indices are relatively small in value and coarsely measured items have five or more response categories. Moreover, Ferrando and Mellenbergh showed that, for the linear model, the CTT

IIF is horizontal and unrelated to u. However, whenever the relationship between latent and observed total scores is nonlinear (which occurs when items are polytomous), the CTT IIFs are no longer unrelated to u. In fact, this article shows that the conditional standard error of x given u is a downward-facing function where scores near category thresholds have the least amount of precision. The accuracy of a linear relationship is also explored, and the results in this article examine the effect of test characteristics (e.g., item locations and discrimination) and subject characteristics (e.g., latent distribution shape) on the appropriateness of linear approximations of polytomous items.

Third, previous Monte Carlo studies that have studied the relative performance of CTT and IRT estimates are limited by the combination of parameter values and type of reliability stud- ied. For instance, Greer et al. (2006) studied the impact of skewness on coefficient alpha with the constraint that item variances were equal, which may not occur frequently in practice. The results in this article can be used to study any combination of IRT parameter values and latent distribution shape and provide more general results than previous Monte Carlo simulations. In addition, Wang et al. (2000) presented equations for computing the reliability of scale scores from performance assessments using the generalized partial credit model. However, unlike Wang et al., this study presents equations for evaluating how the number of items, number of scale categories, and the shape of the latent distribution affects the reliability of ^u and x.

Furthermore, R code (R Development Core Team, 2010) is available at http://publish.illinois.

edu/sculpepper/, and researchers can use the R code to compute the reliability of ^u and x for different item characteristics and latent distributions. Consequently, this study presents new results and provides applied researchers with guidance for reliably scoring tests in different situations.

This article includes five sections. The first section presents equations for the reliability and precision of scores within the CTT and IRT paradigms and includes new results about the CSEM for CTT. The second section compares IRT and CTT in terms of CSEMs to provide a general understanding of the circumstances researchers should prefer CTT versus IRT scores.

The third section compares the relative reliability of ^u and x for different item (e.g., item loca- tions and number of response categories) and subject distribution characteristics (e.g., the skew- ness and kurtosis of u), and the fourth section examines how item and subject characteristics affect the appropriateness of a linear approximation of polytomous items. The last section dis- cusses the results and provides recommendations and concluding remarks.

Equations for the Reliability of CTT and IRT Estimates of u

Let u represent a latent variable and u_{i} be an observed polytomous response, where i indexes
items (i = 1, . . . , I ). The observed polytomous response for item i can be expressed as a function
of a true score (Efuijug) and random error (ei), such that ui= Efuijug + ei. That is, the observed
uiequals an item true score Efuijug (May & Nicewander, 1994), which is a nonlinear function
of u, plus an error, e_{i}. Let j index category thresholds (j = 1, . . . , J ) and J + 1 is the correspond-
ing number of categories for item i. That is, j is used to index categories as well (e.g., J + 1 = 4
implies that uihas four categories). This article assumes that researchers code uiusing integers
from 1 to J + 1. Several polytomous models exist to describe the relationship between u and the
chance that u_{i} equals one of J + 1 categories, which include the graded response (Molenaar,
Dolan, & De Boeck, 2012; Muraki, 1990; Samejima, 1969), partial credit (Masters, 1982;

Muraki, 1992, 1993), and rating scale (Andrich, 1978a, 1978b) models. The derivations in this article use Muraki’s (1990) modified graded response model,

Culpepper 203

P^{}_{ij}ðui. jjuÞ =

1, j = 0

1

1 + exp½a_{i}ðu bð_{i}c_{j}ÞÞ, 0\j J
0, j = J + 1
8<

: , ð1Þ

where biand aiare the item difficulty and discrimination parameters, respectively, and cjis the
jth threshold, which is equal for all I items. Equation 1 represents the chance that u_{i}. j given u
and the probability that item i equals the jth category is

P_{ij}ðu_{i}= jjuÞ = P^{}_{ij}ðu_{i}. jjuÞ P^{}_{ij + 1}ðu_{i}. j + 1juÞ: ð2Þ
Note that Muraki’s model was chosen because the form of Equation 1 is easier to manipulate
analytically and the derivations of expressions for derivatives are less cumbersome.

Muraki’s model assumes that the J cjare constant across items and are equally spaced. The goal of this article is not to estimate abilities or item parameters with Muraki’s model, and these assumptions can be relaxed to evaluate the impact of unequal item thresholds on the reliability and precision of x and ^u. In fact, the new expressions are applicable for any polytomous model, and Muraki’s model is only used for computational examples. Furthermore, the aforementioned models tend to yield scores that are highly correlated (Embretson & Reise, 2000), so we should not anticipate the results would change significantly if the partial credit or rating scale models were used.

The derivations below require the specification of a distribution for u. In this article, u is
assumed to follow a Fleishman PT distribution, j(ujΩ), where Ω = (m, s^{2}, k3, k4) to indicate
that u has a mean m, variance s^{2}, and skewness and kurtosis of k3 and k4, respectively
(j(ujΩ) is discussed in greater detail in the Appendix). Note that any univariate distribution
could be chosen for u. One advantage of using Fleishman’s PT distribution is that it is flex-
ible enough to explore of how changes in m, s^{2}, k3, and k4 affect the relative reliability of
IRT and CTT u estimates. However, as noted by an anonymous reviewer, researchers often
set s^{2}= 1 to estimate item discriminations (i.e., the scale indeterminacy problem).

Accordingly, the results in this article also use s^{2}= 1 to understand how manipulating item
discriminations affects reliability. Fleishman’s PT distribution does not encompass the uni-
verse of all univariate distributions, so future researchers can modify the associated R code
to examine the reliability of CTT and IRT estimates when u follows other distributions.

Furthermore, the discussion below provides an argument as to how reliability within the CTT and IRT frameworks is dependent on the match between the shape of j(ujΩ) and the topography of the conditional variance of x and ^u.

Reliability and Precision Within IRT Framework

One of the strengths of IRT over CTT relates to the well-known measure of precision for esti- mated trait scores, ^u. Reliability is specific to a group and is a function of the unconditional standard error of measurement (SEM). CTT has traditionally calculated the SEM for a group of scores, whereas the CSEM is a measure of precision that corresponds to a specific trait level within the IRT framework. The CSEM of ^u is related to the test information function (TIF), which is derived using the concept of Fisher’s information to measure the amount of informa- tion that a single observation provides about u. In fact, the inverse of a TIF indicates the var- iance of ^u for a given u. Previous research (Muraki, 1993; Samejima, 1994) discussed TIFs for polytomous IRT models and noted that the IIF for item i is

Iið Þ = u X^{J + 1}

j = 1

Pijðui= jjuÞ ∂^{2}

∂u^{2}ln P ijðui= jjuÞ

=X^{J + 1}

j = 1

∂Pijðui= jjuÞ

∂u

h i2

P_{ij}ðu_{i}= jjuÞ∂^{2}Pijðui= jjuÞ

∂u^{2}
2

64

3 75:

ð3Þ

For Muraki’s (1990) modified graded response model, P_{ij}(u_{i}= jju) is a function of
P^{}_{ij}(u_{i}. jju), which is a logistic function. The first two derivatives of P^{}_{ij}(u_{i}. jju) are as
follows:

∂P^{}_{ij}ðu_{i}. jjuÞ

∂u = aiP^{}_{ij}ðui. jjuÞ 1 Ph ^{}_{ij}ðui. jjuÞi
,

∂^{2}P^{}_{ij}ðui. jjuÞ

∂u^{2} = ai

∂P^{}_{ij}ðui. jjuÞ

∂u h1 2P^{}_{ij}ðui. jjuÞi

= a^{2}_{i}P^{}_{ij}ðu_{i}. jjuÞ 1 Ph ^{}_{ij}ðu_{i}. jjuÞi

1 2P^{}_{ij}ðu_{i}. jjuÞ

h i

:

ð4Þ

Accordingly, Ii(u) can be computed using the first and second derivatives of the item cate-
gory probabilities, P_{ij}ðu_{i}= jjuÞ, which are as follows:

∂Pijðui= jjuÞ

∂u =∂P^{}_{ij}ðui. jjuÞ

∂u ∂P^{}_{ij + 1}ðui. j + 1juÞ

∂u ,

∂^{2}P_{ij}ðu_{i}= jjuÞ

∂u^{2} =∂^{2}P^{}_{ij}ðu_{i}. jjuÞ

∂u^{2} ∂^{2}P^{}_{ij + 1}ðu_{i}. j + 1juÞ

∂u^{2} :

ð5Þ

The TIF for a test of polytomous items is the sum of the respective IIFs:

TIF(u) =X^{I}

i = 1

Ii(u): ð6Þ

The relationship between true and observed IRT estimates can be written as ^u = u + e. If u
and e are independent, the variance of ^u is the sum of the true and error variances,
s^{2}f^ug = s^{2}fug + s^{2}feg. The conditional variance of ^u given u is defined as s^{2}f^ujug =
s^{2}fejug = (TIF(u))^{1}. The expected conditional variance of ^u for a specific distribution of u,
j(ujΩ), is as follows:

E s ^{2}^uju

=
ð^{‘}

‘

s^{2}^ujuj ujΩð Þdu: ð7Þ

The expected reliability of ^u is the well-known ratio of true to observed variance,
r_{^}_{u^}_{u}= s^{2}

s^{2}+ E s ^{2}^uju , ð8Þ

where s^{2}is the variance of u specified in j(ujΩ). Clearly, r^{^}u^u is dependent on the characteris-
tics of the test (i.e., s^{2}f^ujug) and the distribution of latent scores (i.e., j(ujΩ)).

Culpepper 205

Reliability and Precision Within CTT Framework

Let x be the total score, or sum, of the I ui, such that ui= Efuijug + ei. For a subject with a given u, uiequals one of J + 1 categories each with probability Pij(ui= jju), so Efuijug is

E uf _{i}jug = X^{J + 1}

j = 1

jP_{ij}ðu_{i}= jjuÞ: ð9Þ

That is, Efuijug is a weighted average of the category scores (i.e., j = 1 to J + 1) and the chance
that subjects with a specific u have an observed score u_{i}. The previous section discussed an
expression for the expected conditional variance of ^u as a function of the inverse TIF and distri-
bution of u. This section analogously derives an expression for Efs^{2}fxjugg, which is the
expected variance of x for a given value of u. An equation for the reliability of x is presented as
a function of Efs^{2}fxjugg and the variance of the expected true scores, s^{2}fEfuijugg. This sub-
section proceeds by first deriving an expression for the conditional error variance of x,
Efs^{2}fxjugg, and then identifies an equation for the true score variance, s^{2}fEfuijugg.

Consider item i where the error is e_{i}= u_{i} Efuijug. Note that ui is a coarse measure of
Efuijug in that uiis ordinal and equals one of the J + 1 values, whereas u is measured on an inter-
val scale. One immediate observation is that Efeijug = 0. Recall that uiis a polytomous item, so
e_{i}= j Efu_{i}jug for j = 1 to J + 1. The conditional expectation of the error within item i is

E ef _{i}jug =X^{J + 1}

j = 1

P_{ij}ðu_{i}= jjuÞ j E uð f _{i}jugÞ =X^{J + 1}

j = 1

jP_{ij}ðu_{i}= jjuÞ E uf _{i}jug = 0: ð10Þ

The variability of observed uiaround expected values is determined by the error variance or
precision; that is, s^{2}fu_{i}jug = Efe^{2}_{i}jug and is defined as

s^{2}fu_{i}jug = Enðu_{i} E uf ijugÞ^{2}juo

=X^{J + 1}

j = 1

P_{ij}ðu_{i}= jjuÞ j E uð f ijugÞ^{2}: ð11Þ

Equation 11 is new to the literature and the following sections compare the properties of
s^{2}fu_{i}jug with s^{2}f^ujug.

s^{2}fuijug is the conditional variance for a single item and an expression is needed for the con-
ditional variance of x. An important observation is that errors within two polytomous items, say,
uiand uh(let categories for uhbe indexed by k), are independent whenever the items are locally
independent, which assumes that P(ui= j, uh= kju) = Pij(ui= jju)Phk(uh= kju). Specifically, the
covariance between e_{h}and e_{i}conditioned on u is

s ef _{i}, e_{h}jug = E ufð i E uf ijugÞ uð h E uf hjugÞjug

=X^{J + 1}

j = 1

X^{J + 1}

k = 1

j E uf _{i}jug

ð Þ k E uð f _{h}jugÞP uð _{i}= j, u_{h}= kjuÞ

=X^{J + 1}

j = 1

j E uf ijug

ð ÞPijðui= jjuÞX^{J + 1}

k = 1

k E uf hjug

ð ÞPhkðuh= kjuÞ = 0:

ð12Þ

The finding that errors within items are independent, if local independence is assumed, is par-
ticularly important, because the conditional error variance of the total score, x, given u is sim-
ply the sum of the conditional item variances, s^{2}fuijug.

Recall that x = PI

i = 1ui. Equation 9 implies that the expected value of x given u is
E xjuf g = X^{I}

i = 1

E uf ijug =X^{I}

i = 1

X

J + 1

j = 1

jPijðui= jjuÞ: ð13Þ

The conditional variance of x given u is the sum of conditional variances for the I ui, under the
assumption of local independence (see Equation 12), which implies that
s^{2}fxjug =PI

i = 1s^{2}fu_{i}jug. The expected conditional variance for a specific distribution of u is
E s ^{2}fxjug

=
ð^{‘}

‘

s^{2}fxjugj ujΩð Þdu: ð14Þ

Recall that in IRT the expected variance of maximum likelihood estimates is the expected value
of s^{2}f^ujug across the distribution of u. Similarity, the expected conditional variance of x is
found by replacing s^{2}f^ujug with s^{2}fxjug. Consequently, researchers can compare the CSEM
(i.e., sfxjug and sf^ujug) to understand which values of u are associated with relatively more
precision within the CTT and IRT frameworks.

Recall that Efxjug is the expected total score for subjects with a given u and the variance of Efxjug across subjects provides a measure of the amount of true score variance. First, note that the unconditional mean of x is

E xf g = E E xjuf f gg =
ð^{‘}

‘

E xjuf gj ujΩð Þdu: ð15Þ

The variance of Efxjug across u is

s^{2}fE xjuf gg = EnðE xjuf g E xf gÞ^{2}o

=
ð^{‘}

‘

E xjuf g E xf g

ð Þ^{2}j ujΩð Þdu: ð16Þ

The reliability of x is the ratio of true to observed variance:

r_{xx}= s^{2}fE xjuf gg

s^{2}fE xjuf gg + E sf ^{2}fxjugg: ð17Þ

Factors That Affect the Precision of CTT and IRT Scores

The previous section derived expressions for the CSEMs for x and ^u (i.e., sfxjug and sf^ujug).

The characteristics of polytomous IRT CSEMs are well understood. For example, sf^ujug tends to be smaller at points along the latent continuum where category thresholds are located and sf^ujug declines in regions where more discriminating items are located. In contrast, the expres- sion for sfxjug is new and the purpose of this section is to compare sfxjug with sf^ujug to pro- vide researchers with a conceptual understanding of the testing situations where x may be preferred to ^u in terms of measurement precision.

As an example, consider a test consisting of three items that each have four response cate- gories. Moreover, let a and b be three-dimensional vectors of item discriminations and loca- tions, such that a = (0.5, 1.5, 2.5) and b = (21, 0, 1). Moreover, to simplify this example, let

Culpepper 207

the category thresholds be equally spaced with values of (21.6, 0, 1.6) units below and above u bi(i.e., the thresholds for the first item are located at 22.6, 21.6, and 0.6 on the u scale).

Figure 1 presents s^{2}fuijug (see Equation 11) and ½Ii(u)^{1} (see Equation 3) for the three
items. The IRT conditional variances exhibit expected behavior. That is,½Ii(u)^{1} is lowest at
the category thresholds and½I3(u)^{1} is generally smaller than the other two items because the
third item is more discriminating. One additional nuance is that½I_{3}(u)^{1} has local minima at
the category thresholds, whereas Items 1 and 2 have inverse information functions that appear
smoother. Stated differently, in IRT, items with larger discriminations tend to have IIFs with
more topography in regions near category thresholds.

As expected, s^{2}fuijug is smaller for more discriminating items; however, the general beha-
vior of s^{2}fuijug is different from ½Ii(u)^{1}. Namely, s^{2}fuijug tends to be a downward-facing
function where measurement error is largest at category thresholds. For example, s^{2}fu_{3}jug tends
to be the smallest of the three items, but s^{2}fu3jug has maxima near the category thresholds,
which differs from IRT where measurements are more precise at category thresholds. Under a
Figure 1. Conditional variances of CTT and IRT scored items for a hypothetical three-item test with
four response options and equally spaced thresholds, c_{j}= (21.6, 0, 1.6).

Note: CTT = classical test theory; IRT = item response theory.

CTT framework, s^{2}fuijug is smallest for either more extreme u or for u values that lie between
category thresholds. In short, s^{2}fuijug is roughly the mirror image of ½Ii(u)^{1}, because½Ii(u)^{1}
tends to be lower in segments of the u continuum where s^{2}fuijug is larger and vice versa.

Figure 2 includes the same items discussed in Figure 1, but with the exception that the
thresholds are no longer equally spaced. Figure 2 shows that s^{2}fuijug and ½Ii(u)^{1} respond
inversely to unequal item thresholds. For instance, s^{2}fu_{i}jug tends to increase in portions of the
latent continuum that include more item thresholds, whereas½Ii(u)^{1} is smaller in segments of
the latent continuum where there are more item thresholds and increases wherever there are
fewer thresholds.

Figure 3 demonstrates CTT and IRT test CSEMs along with error bars around conditional
expected values, Efxjug and Ef^ujug. The top row of Figure 3 illustrates the CTT and IRT
CSEMs for x and ^u, respectively, for the hypothetical three-item test. Note that the vertical lines
in the top row of panels in Figure 3 indicate category thresholds for Items 1, 2, and 3. Figure 3
shows that s^{2}fxjug is larger for u values that are near category thresholds, whereas s^{2}f^ujug is
smaller at points on the latent continuum that have more thresholds and more discriminating
Figure 2. Conditional variances of a CTT and IRT scored items for a hypothetical three-item test with
four response options and unequally spaced thresholds, cj= (21.6, 0, 1.0).

Note: CTT = classical test theory; IRT = item response theory.

Culpepper 209

items. For the three-item test, s^{2}f^ujug appears more responsive to item discriminations than
s^{2}fxjug as indicated by the fact that the slope of s^{2}f^ujug is steeper than s^{2}fxjug in the u range
measured by Item 1.

The second row of panels in Figure 3 plots Efxjug and Ef^ujug as well as 62 times the
CSEMs. As discussed previously, Efxjug is a more accurate indicator of u for more extreme
values on the u continuum, whereas ^u is a better indicator in the range where items and cate-
gories are located. The overall reliability of x and ^u is dependent on the shape and location of
the latent distribution. For example, if u;N (0, 1) most subjects lie in the middle range of the
latent continuum where x is less precisely measured relative to ^u. In fact, the differences in
s^{2}fxjug and s^{2}f^ujug contribute to ^u being significantly more reliable than x (i.e., 0.65 vs. 0.49).

Certainly, r_{xx}and r^u^uwill change depending on the shape and location of the u distribution.

Figure 3. Test score CTT and IRT conditional standard error of measurement and expected value plots with 62 error curves for a hypothetical three-item test with four response categories.

Note: CTT = classical test theory; IRT = item response theory. Thresholds are indicated by dashed-vertical lines for
Item 1 (b_{1}= 21, a_{1}= 0.5), Item 2 (b_{2}= 0, a_{2}= 1.5), and Item 3 (b_{3}= 1, a_{3}= 2.5) and c_{j}= (21.6, 0, 1.6). r_{xx}. and r^u^u

were calculated under the assumption that u;N(0, 1).

Figure4.ReliabilityofCTTandIRTscoresofa10-itemtestfordifferentitemlocations,numberofresponsecategories,andsubjectlatentdistributionshapes. Note:CTT=classicaltesttheory;IRT=itemresponsetheory.ai=1.0andcategorythresholdswereequallyspacedalongthelatentcontinuum.

The Reliability of x and ^u for Item and Subject Characteristics

Figure 3 included results for a simple example to demonstrate the theoretical differences
between s^{2}fxjug and s^{2}f^ujug, which are useful pieces of information for developing tests and
instruments. That is, s^{2}fxjug and s^{2}f^ujug provide applied researchers with an understanding
of ranges of u values that are best measured with x or ^u. Moreover, s^{2}fxjug and s^{2}f^ujug offer
researchers information about which measurement framework is most beneficial for various
item characteristics and subject populations.

Equations 8 and 17 were used to compare the reliability of x and ^u as a function of the num-
ber of scale categories, purpose of measurement (i.e., item locations dispersed along the conti-
nuum or clustered in a given region), and u distribution shape. More specifically, Figure 4
includes r^u^u and r_{xx} across scale categories (i.e., 2–10 response options) for three types of item
locations and four types of distributions for u. Item discriminations and test length were not
manipulated and were fixed at 1.25 and 10, respectively, It is well known that increasing either
item discrimination or test length increases reliability, and these parameters were held constant
to focus on the other parameters.

To simplify the discussion, the three scenarios assume that items have equally spaced cate-
gory thresholds. Specifically, the item category thresholds (i.e., the J cj) were equally spaced
between 22.0 and 2.0 on the u bi continuum. Let c be the vector of category thresholds that
are defined as c = 2(2J(J + 1)^{1} 1) where J is a vector with elements equal to the integers
from 1 to J. For example, the threshold is zero for items with two scale categories (i.e., J + 1 =
2) and items with four scale categories have three thresholds at 21, 0, and 1. Whereas the fol-
lowing discussion assumes the thresholds are equally spaced, researchers can input any set of
category thresholds into the R code, which is available at the author’s website.

Let b be a vector of item locations and I be a vector that includes integers from 1 to I. The
three item location scenarios represent situations where researchers would be interested in mea-
suring u values in a narrow range in the lower or upper tails or measuring u values across the
latent continuum. The item locations for the three scenarios represent the following uniform
distributions: bi;U ( 2:5, 1:5) (i.e., b = 0:5(2I(I + 1)^{1} 1) 2), bi;Uð2:0; 2:0Þ (i.e.,
b = 2(2I(I + 1)^{1} 1)), and b_{i};U 1:5, 2:5ð Þ (i.e., b = 0:5(2I(I + 1)^{1} 1) + 2).

As noted, four subject distributions were examined to evaluate how the density of the popula-
tion at various points on the latent continuum affected r^u^uand r_{xx}. Specifically, the distributions
were negatively skewed (g_{3}= 21.5, g_{4}= 4.0), normal (g_{3}= 0, g_{4}= 0), symmetric and peaked
(g_{3}= 0, g_{4}= 4.0), and positively skewed (g_{3}= 1.5, g_{4}= 4.0).

Figure 4 includes 12 panels corresponding to the three item locations and four subject distri-
bution types. The middle row of panels in Figure 4 demonstrates that CTT and IRT yield simi-
lar reliabilities in situations where tests consist of items that are evenly placed along the latent
continuum. Given that s^{2}fxjug is smaller in areas with fewer category thresholds, one explana-
tion for the slight advantage of IRT is that CTT CSEMs decline outside of the (22, 2) range of
the items where there are fewer subjects in the skewed and symmetric distributions. More pre-
cisely, Figure 3 showed that s^{2}f^ujug is significantly larger relative to s^{2}fxjug in portions of
the latent continuum that has fewer category thresholds. In contrast, s^{2}fxjug is relatively
smoother across u values and tends to decline in segments where there are fewer category
thresholds. Furthermore, in the case where items are located across the latent continuum, ^u is
understandably more reliable than x when the latent distribution is more peaked (i.e., positive
kurtosis). The middle row of Figure 4 also shows the expected positive relationship that
increasing scale categories has on reliability for both ^u and x. However, as found in previous
research (e.g., see a review of relevant literature in Chafouleas et al., 2009, and Weng, 2004),

the value of an additional response category increases at a decreasing rate and reliability does not increase significantly in value beyond four or five response categories.

In theory, x is expected to be more reliable than ^u when the latent distribution is mismatched
with item locations. The top and bottom panels of Figure 4 demonstrate the theoretical results
discussed in this article concerning the effect that subject and item characteristics have on CTT
and IRT reliability. For instance, CTT is superior to IRT whenever the items are located at the
extremes of the latent distribution. Furthermore, the difference between r_{xx}and r^u^uis largest for
two category items and approaches zero as the number of categories increase. As discussed ear-
lier, IRT is a tool for measuring specific u values more precisely. In fact, the difference between
r_{xx}and r^u^uis smallest when the majority of the latent distribution overlaps with item locations,
which occurs when items are located in the lower portion of the latent continuum and the latent
distribution is positively skewed (e.g., bi;U ( 2:5, 1:5), g_{3} = 1.5, and g_{4} = 4.0) or when
items are in the upper tail and the distribution is negatively skewed (i.e., b_{i};U (1:5, 2:5), g_{3} =
21.5, and g_{4}= 4.0).

In short, the results in this section provide new information about the influence of item and
subject characteristics on the relative reliability of x and ^u. In fact, the findings reflect the
expected results given the nature of s^{2}fxjug and s^{2}f^ujug. That is, total scores were more reli-
able whenever items were mismatched with the location of the latent u distribution. The follow-
ing section addresses another important concern related to the accuracy of linear
approximations of polytomous items.

Appropriateness of a Linear Approximation of Polytomous Items
One approach for modeling polytomous items is to use a linear approximation of the
relationship between u and u_{i} (e.g., a common factor model, Culpepper, 2012b). As noted in
Equation 13, the relationship between u and u_{i}is nonlinear. Ferrando (2009) and Mellenbergh
(1996) noted that a linear relationship tends to provide a reasonable approximation when item
discrimination indices are relatively low and there are five or more response categories. The
purpose of this section is to explore the appropriateness of linear approximations of polytomous
items in greater detail. This section includes two subsections. The first subsection describes a
measure for quantifying the appropriateness of a linear approximation, whereas the second sub-
section presents findings about the accuracy of a linear approximation for different item charac-
teristics (e.g., items locations, number of response categories, and item discriminations) and
subject characteristics (e.g., the skewness and kurtosis of the latent distribution).

A Measure for the Appropriateness of a Linear Approximation

Let li(u) be the loading relating latent u to observed uifor a given u. The relationship between u and uifor a specific value of u is the first derivative of Equation 9 with respect to u:

l_{i}ð Þ =u ∂E uf ijug

∂u = X^{J + 1}

j = 1

j∂Pijðui= jjuÞ

∂u : ð18Þ

The relationship between u_{i}and u will be larger for some values of u and near zero at the
extremes of the u continuum (i.e., portions of the latent continuum where Efuijug plateaus). A
linear approximation is dependent on the characteristics of the examinee population. The
expected loading for a given population of subjects can be found by averaging li(u) over the
distribution of u:

Culpepper 213

E lf ið Þu g = E ∂E uf _{i}jug

∂u

= X^{J + 1}

j = 1

j
ð^{‘}

‘

∂P_{ij}ðu_{i}= jjuÞ

∂u j ujΩð Þdu: ð19Þ

The function that constrains the relationship between u and u_{i}to be linear for a given popula-
tion of examinees is

L uf ijug = E uf g + E li f ið Þu g u mð Þ, ð20Þ where Efuig =Ð

Efuijugj(ujΩ)du, m = Efug, and Lfuijug denote the linear approximation of the relationship between u and ui. Also, note that Lfuijug has the same form as the least squares projection where Lfuijug passes through the centroid for u and ui (m, Efuig) and the relation- ship is the expected slope for a given population of examinees, Efli(u)g.

Let di= ui Lfuijug be the error when predicting the observed item uiwith a linear approxi- mation. Unlike, Efeijug; Efdijug 6¼ 0,

E df _{i}jug = E uf _{i} L uf _{i}jugjug = E uf _{i}jug L uf _{i}jug: ð21Þ
That is, d_{i} will be biased for certain values of u but EfEfdijugg = 0 across the u range for a
given population. The bias in didoes not translate into larger conditional variances for Lfuijug.

In fact, s^{2}fdijug = s^{2}fuijug,

s^{2}fdijug = Enðdi E df ijugÞ^{2}juo

= Enðui E uf ijugÞ^{2}juo

= s^{2}fuijug: ð22Þ
Furthermore, the bias in didoes not introduce dependence among residuals. Let dh= uh Lfuhjug
be a residual when linearly approximating uh. The covariance between errors is

s df i, dhjug = E dfð i E df ijugÞ dð h E df hjugÞjug

= E ufð i E uf ijugÞ uð h E uf hjugÞjug = 0, ð23Þ where the last inequality was obtained by recalling that the assumption of local independence implies Efeiehjug = 0 as shown in Equation 12.

Ferrando (2009) described a measure of the appropriateness of a linear model, which is based on restricting observed scores to fall within feasible ranges (e.g., 1 and J + 1 for a polytomous item with J + 1 categories). More precisely, 1\ Lfuijug\ J + 1 implies that a linear approxi- mation is appropriate for u scores within the following range:

1 E uf gi

E lf _{i}ð Þu g + m\u\J + 1 E uf gi

E lf _{i}ð Þu g + m: ð24Þ
Following Ferrando, the floor and ceiling indices for the appropriateness of a linear approxi-
mation of ui are d0= (1 Efuig=Efli(u)g) + m and d1= (J + 1 Efxg=Efli(u)g) + m, respec-
tively. Ferrando’s measure of appropriateness is the proportion of examinees with u scores in
the viable range:

P dð 0\u\d1Þ = ð

d1

d0

j ujΩð Þdu: ð25Þ

Consider the examples presented in Figure 5, which compares Efu_{i}jug with Lfu_{i}jug for three
items with bi= (22, 0, 2) and the following parameter values held constant, a_{i}= 2: for all i, four
response categories (J + 1 = 4), category thresholds of c = 21.6, 0, 1.6, andΩ = (0, 1, 0, 0). The

middle panel in Figure 5 includes the case where bi = 0. Figure 5 shows that Efuijug and Lfuijug deviate at the ends of the latent continuum. However, given that the distribution of u is standard normal, nearly all of the subjects (i.e., 98.9%) have approximated scores between dl

and du. The top and bottom rows of Figure 5 include examples of items that are located at 22
and 2, respectively. In fact, when items are located at the extremes of the u continuum
P(dl\u\d_{u}) = 0.876%, which implies that 12.4% of subjects have approximated scores out-
side the viable bounds. The examples in Figure 5 demonstrate that item location and the shape
of the latent distribution affect the appropriateness of linear approximations. For instance,
Figure 5 provides insight that P(dl\u\d_{u}) is larger in value if items are located near the mean
of a standard normal distribution.

Impact of Item and Subject Characteristics on the Appropriateness of a Linear Approximation

The previous subsection discussed P(d0\u\d_{1}) as a measure of the appropriateness of a linear
model. The purpose of this subsection is to present more general findings about the role of item
Figure 5. Hypothetical Efxjug and Lfxjug for items with different locations and four response options.

Note: The minimum and maximum total scores are denoted on the y-axis by 1 and 4. Also, ;N(0, 1), ai= 2.0, and cj= (21.6, 0, 1.6).

Culpepper 215

and subject characteristics on the accuracy of a linear approximation. Specifically, this subsec- tion discusses the appropriateness of a linear approximation of uifor 12 scenarios as defined by three response category scenarios (J + 1 = 2-10) and four u distribution scenarios (e.g., the same distributions examined in Figure 4) with the addition that the item discrimination indices are compared for ai = 0.5, 1.0, 1.5, and 2 and category thresholds are defined as discussed for the scenarios in Figure 4.

Figures 6 and 7 plot P(d0\u\d1) against item locations (i.e., bi) and provide evidence
about circumstances where a linear approximation is appropriate. Note that the rows of Figures
6 and 7 correspond to the number of response categories (i.e., J + 1), whereas columns relate to
distribution shape. Comparing rows demonstrates that P(d0\u\d_{1}) increases as the number of
response options increases; however, the accuracy of a linear approximation does not improve
significantly beyond four response categories. For instance, a linear approximation is appropri-
ate and P(d_{0}\u\d_{1}) . 0:90 for items with five or more response categories that are located
within one standard deviation of the mean. Furthermore, larger item discriminations have a neg-
ative effect on P(d0\u\d_{1}) and the number of response categories and item discrimination
has an interactive effect where P(d0\u\d1) declines more as aiincreases and there are fewer
response categories. Moreover, Figure 6 illustrates that a linear approximation is appropriate for
as few as two response options when a_{i}= 0.5 or whenever a_{i}= 1.0 and items are located in the
middle of the distribution.

Comparing columns provides an indication of the effect of latent distribution shape and item
location on P(d0\u\d1). Specifically, the relationship between biand P(d0\u\d1) reflects
the shape of the latent distribution, and P(d0\u\d1) is smallest when biis located in the tails
of the latent distribution. For instance, the P(d_{0}\u\d_{1}) curves for the normal distribution
appear more bell shaped when the distribution is normal as opposed to peaked (i.e.,
k3= 0, k4= 4). In fact, P(d0\u\d1) is smaller when u;N (0, 1), ceteris paribus. The relation-
ship between item location and P(d0\u\d1) is cubic when the latent distribution is skewed,
and P(d_{0}\u\d_{1}) is smaller in skewed distributions in segments where there is less density in
the latent distribution (e.g., the right tail if k_{3}= 1:5 and the left tail if k3= 1:5).

Figure 6 also demonstrates the effect of item discrimination on P(d_{0}\u\d_{1}). Ferrando
(2009) noted that a linear approximation is most appropriate when ai is smaller. In fact,
Ferrando’s recommendations are supported given that a linear approximation is expected to be
appropriate for almost all subjects and item locations when a_{i} = 0.50. Figures 6 and 7 also
show conditions where linear approximations are appropriate even when items are more discri-
minating (e.g., ai= 2). For example, P(d0\u\d_{1}) is larger when J + 1 3 and items are
located near the middle of the latent distribution. P(d0\u\d1) declines as aiincreases for all
of the scenarios included in Figure 6 albeit at different rates. In short, the results in Figures 6
and 7 imply that linear approximations are best for items that measure u near the central portion
of the latent distribution. Furthermore, Lfuijug is least accurate for highly discriminating items,
and a linear approximation is inaccurate even if there are more than four response categories
when items are located in the tails of the latent distribution.

Discussion

The findings in this study offer guidance to applied researchers interested in the construction and analysis of polytomous items. In short, this article presented new theoretical results con- cerning item and subject characteristics that affect the relative precision and reliability of x and

^u. This section summarizes the findings for psychometricians and applied researchers and offers concluding remarks.

Figure6.Theproportionofviablescoresfromalinearapproximationofpolytomousitemswithtwotofourresponsecategoriesbyitemlocationand discrimination,andsubjectlatentdistributionshapes. Note:Categorythresholdswereequallyspacedalongthelatentcontinuum.

Figure7.Theproportionofviablescoresfromalinearapproximationofpolytomousitemswithfivetosevenresponsecategoriesbyitemlocation discrimination,andsubjectlatentdistributionshapes. Note:Categorythresholdswereequallyspacedalongthelatentcontinuum.

This article studied the reliability of two alternative test scoring approaches: a CTT total
score versus an IRT ^u estimate. The majority of applied researchers in education and psychol-
ogy are familiar with x, whereas fewer have knowledge about ^u and IRT. Consequently, it is
important to offer applied researchers information about the relative merits of x and ^u. This arti-
cle offered additional insights into fundamental differences between x and ^u. One salient factor
studied in this article was the interactive effect that item locations and u distribution shape had
on r_{xx}and r^u^u. Suppose the purpose of testing is to accept high-scoring examinees into an insti-
tution as is the case with examinations for actuarial sciences where the latent distribution is
either normal or positively skewed. For instance, the first actuarial exam includes items clus-
tered in the upper portion of the u continuum to identify those students who are most competent
in the foundations of calculus and probability theory. The scores for examinees who are near the
passing cutoff are measured more precisely than those who are in the middle or lower portions
of the u distribution as indicated by smaller values for s^{2}f^ujug. In contrast, the results in this
article show that s^{2}fxjug is larger near the cutoff score and would be relatively smaller for u
values in the lower and middle portions of the examinee distribution. Certainly, decision makers
would prefer IRT scoring versus a total score in this instance, because the purpose of measure-
ment is to evaluate whether test takers exceed a minimum proficiency level. However, second-
ary users of the actuarial test score data should prefer total scores, because, as indicated in
Figure 4, x tends to be more reliable for a population of subjects than ^u if items are difficult and
located in the tails of the test-taker population distribution. For instance, some secondary users
may want to gather validity evidence and correlate examinee scores with other indicators, such
as undergraduate or graduate grade point average (Culpepper, 2010; Culpepper & Davenport,
2009), other aptitude tests, or job performance (Aguinis, Culpepper, & Pierce, 2010). In these
instances, the validity coefficient associated with ^u would be smaller than the coefficient for x
because r_{xx}. r^u^u.

In addition, the results provide evidence that CTT and IRT scoring methods perform simi-
larly when the goal of testing is to measure a construct across values of a latent continuum. For
example, federal and state testing programs measure what students know in relation to given
standards. The results in Figure 4 imply that r_{xx}and r^u^uare similar and testing programs could
report total scores, which may be easier to explain to certain stakeholders (e.g., teachers, par-
ents, and students).

The results in this article also offer recommendations for the construction of scales that include polytomous items, such as educational or employment performance assessments, beha- vioral ratings, or affective measurements. In either case, researchers may prefer x to ^u if the purpose of instrument development is to conduct correlational research rather than to measure specific trait levels. However, x should probably only be preferred to ^u when researchers fit simple models rather than more complicated interactive models (Embretson, 1996; Kang &

Waller, 2005; Morse et al., 2012).

In addition, the findings in this article provide information about the optimal number of item
response options. The results in Figure 4 examined the reliability of total scores and IRT esti-
mates across a range of parameter values for the number of scale categories, item locations,
and the shape of the latent distribution. The findings in Figure 4 imply that using more than five
or six scale categories does not significantly improve the reliability of x or ^u regardless of the
shape of the latent distribution or location of items. However, it is important to note that adding
an additional scale value had a larger effect on r^u^uthan on r_{xx}.

Another relevant finding for applied researchers (and methodologists) relates to the appropri- ateness of a linear approximation of polytomous items. The results in this article confirm argu- ments in previous research (Ferrando, 2009) that linear approximations are more accurate for less discriminating items and items with more response categories. Additional findings suggest

Culpepper 219

that linear approximations of polytomous items seem appropriate for items that measure trait levels in denser segments of the latent distribution, and linear approximations were least appro- priate for items located in the tails of the u distribution. In contrast to previous research, the results provided new evidence that researchers need to consider item locations, in addition to item discriminations and the number of response categories, when employing linear approximations.

Last, another contribution of this study is the availability of the associated R code. More spe- cifically, researchers can use the R code when designing instruments in an effort to evaluate conditions where total scores and IRT scores are more or less reliable and precise. Furthermore, the R code has pedagogical value, as well, for computational applications of the theoretical results.

There are several directions for future research to build on this study. First, this article offers recommendations for researchers who are interested in using x as a measure of u in applied research. Specifically, designing a reliable x requires the inclusion of items located at the bound- aries of the u range of interest. For instance, if the measurement goal is to distinguish high ver- sus low scorers on some trait, the results pertaining to CTT CSEMs dictate that the items should be located in the middle of the distribution, because low and high scorers will be measured more precisely. Likewise, items should be located around a cutscore (and not at the cut-score) if the purpose of measurement is to make inferences about whether examinees exceed or fall below some minimum proficiency level. The behavior of CSEMs under CTT is counterintuitive, because, in contrast, item design under the IRT framework dictates that developers should write items that are specific to certain u levels and measurement purposes. Additional research is needed to understand differences in optimal test assembly (H. Chang & Ying, 2009) within the IRT and CTT frameworks.

Second, there could be benefits in revisiting some topics in modern IRT within the context of CTT. For instance, there may be new insights available by reexamining topics in CTT, such as computer adaptive testing (H. Chang & Ying, 1996) or equating techniques (Kolen & Brennan, 2004), which could lead to new methodologies, refinements of existing approaches, or other unanticipated discoveries.

Third, researchers could extend the results in this article to understand the impact of using x and ^u. Specifically, researchers use total scores as dependent variables and predictors in every subdiscipline of psychology and education. Despite the widespread use of total scores, few methodological studies have examined the impact of using total scores on the power and Type I error rates of tests that researchers employ (Embretson, 1996; Kang & Waller, 2005; Morse et al., 2012). Furthermore, with the exception of Embretson (1996), previous studies utilized Monte Carlo techniques that are limited by the parameter values studied. Consequently, future analytic explorations could provide additional insights into the effect of using total scores, and future research should accordingly examine the effect that the number of scale categories, the shape of the latent distribution, and IRT parameters have on the performance of commonly used statistical tests (Culpepper, 2012a; Culpepper & Aguinis, 2011).

Fourth, this study examined the theoretical reliability of x and ^u using a polytomous IRT model. As one anonymous reviewer noted, this article addressed reliability from a mathematical perspective and does not consider factors related to subjects’ cognitive decision making. For example, this article did not address issues related to category labels; however, previous research identified a causal effect of scale labels, category position, and rating scale intensity and length on certain observed item characteristics (Dunham & Davison, 1991; Lam & Stevens, 1994; Murphy & Constans, 1987). For instance, existing evidence suggests that scale labels can affect the observed item means, but there is less evidence that manipulating category labels alters observed item variances (L. Chang, 1997; Dunham & Davison, 1991). Moreover,

positive-packed scales tend to affect item means (Dunham & Davison, 1991; Lam & Kolic, 2008), and semantic compatibility of category labels improves reliability (Lam & Kolic, 2008).

Most of the previous literature on category labels used CTT or generalizability theory, and addi- tional empirical research is needed to understand how category labeling decisions affect polyto- mous IRT item parameters (e.g., item locations, discriminations, and category thresholds) and consequently alter the reliability of x and ^u. Future research may identify relationships between category labels, cognitive decision making, and IRT parameters, and the results in this article provide mathematical arguments for describing how subsequent changes in IRT parameters affect test score reliability.

In conclusion, this study presented new results concerning the relative reliability and preci- sion of total scores and IRT scores. The derivations in this article offer the most extensive analy- sis of the reliability of total scores by linking parameters of polytomous IRT models with CTT.

In addition, new equations were discussed that described the CTT CSEM to provide new con- ceptual understanding of differences in the precision of scores estimated within the CTT and IRT frameworks.

Appendix

Fleishman’s Power Transformation (PT) Method Probability Density Function (PDF) Fleishman (1978) developed a PT method for generating nonnormal univariate random vari- ables. The PT method uses the following function to transform a standard normal random vari- able, y, into a metric with a given k3and k4,

u = f yjvð Þ =X^{4}

r = 1

v_{r}y^{r1}, ðA1Þ

where the vector of Fleishman coefficients, v = (v_{1}, v_{2}, v_{3}, v_{4}), is identified so that u has a pre-
determined mean, variance, skewness, and kurtosis. Additional research extended Fleishman’s
method to higher order PTs (Headrick, 2002) and multivariate circumstances (Headrick &

Sawilosky, 1999; Lyhagen, 2008; Vale & Maurelli, 1983).

One critique of Fleishman’s method was that there was no known probability distribution for variables generated using Fleishman’s technique (Tadikamalla, 1980). Recent research derived the PDF for Fleishman’s PT method (Headrick & Kowalchuk, 2007) and showed that the PDF for Fleishman’s PT method is

j ujΩð Þ = f fð ^{1}ðujvÞÞ

sf^{0}ðf^{1}ðujvÞjvÞ, ðA2Þ
where f(u) is the standard normal distribution, f^{1}(ujv) is the inverse of Equation A1 that spe-
cifies values of y as a function of u, and f^{0}(yjv) is the derivative of Equation A1 with respect to
y. An expression for f^{1}(ujv) can be found using Cardano’s formula for the inverse of a cubic
polynomial. The equation for the only real root of f^{1}(ujv) is defined as

y = f^{1}ðujvÞ = q +

ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ
q^{2}+ rð p^{2}Þ^{3}

q 1=3

q

ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ
q^{2}+ rð p^{2}Þ^{3}
q

1=3

+ p, ðA3Þ

where the second cube root is the absolute value of q

ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ
q^{2}+ (r p^{2})^{3}
q

. Furthermore, p, q, and r are a function of v and u as shown in the following:

Culpepper 221

p = v_{3}
3v_{4},

q = p^{3}+v_{2}v_{3} 3v_{4}v_{1}^{um}_{s}

6v^{2}_{4} ,

r = v_{2}
3v4

: ðA4Þ

This article uses j(ujΩ) to understand the impact of nonnormal latent distributions on the
reliability of x and ^u. Specifically, v is first identified for a given k_{3} and k_{4}, and the p, q, and r
in Equation A4 are computed to find f^{1}(ujv) to use in Equation A3.

It is important to note that Fleishman’s PT method yields a valid PDF only when u is a mono-
tonically increasing function of y (i.e., f^{0}(yjv) must be positive for all values of y). Headrick
and Kowalchuk (2007) proved that the PT method produces a valid PDF if the Fleishman coeffi-
cients satisfy the following constraints:

v_{4}.1
5

ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ
5 + 7v^{2}_{2}

3 r

2
5v_{2},
0\v2\1:

ðA5Þ

Consequently, Fleishman’s PT method can only be used to study the impact of nonnormal latent
distributions whenjk_{3}j\4:5.

Declaration of Conflicting Interests

The author declared no potential conflicts of interest with respect to the research, authorship, and/or publi- cation of this article.

Funding

The author received no financial support for the research, authorship, and/or publication of this article.

References

Adelson, J. L., & McCoach, D. B. (2010). Measuring the mathematical attitudes of elementary students:

The effects of a 4-point or 5-point Likert-type scale. Educational and Psychological Measurement, 70, 796-807.

Aguinis, H., Culpepper, S. A., & Pierce, C. A. (2010). Revival of test bias research in preemployment testing. Journal of Applied Psychology, 95, 648-680.

Aguinis, H., Pierce, C. A., & Culpepper, S. A. (2009). Scale coarseness as a methodological artifact:

Correcting correlation coefficients attenuated from using coarse scales. Organizational Research Methods, 12, 623-652.

Andrich, D. (1978a). Application of a psychometric model to ordered categories which are scored with successive integers. Applied Psychological Measurement, 2, 581-594.

Andrich, D. (1978b). A rating formulation for ordered response categories. Psychometrika, 43, 561-573.

Bandalos, D. L., & Enders, C. K. (1996). The effects of nonnormality and number of response categories on reliability. Applied Measurement in Education, 9, 151-160.

Bechger, T. M., Maris, G., Verstralen, H. H., & Beguin, A. A. (2003). Using classical test theory in combination with item response theory. Applied Psychological Measurement, 27, 319-334.