Latent Trait Theory - 應用潛在特質理論發展與驗證一單字測驗

Latent Trait Theory, or Item Response Theory, is another widely used approach

in language testing. LTT assumes that two variables decide an individual’s

performance on a test: the ability of a test-taker and the characteristic of items.

Underlying LTT is the belief that the performance of subjects on a test can be

explained or predicted by estimating their abilities, or traits as embodied in the test

score. Traits are often referred to as latent traits as they are abstract, unobservable

traits which are not directly measurable (Hambleton & Swaminathan, 1984). As

opposed to CTT, LTT is capable of producing indices that are comparable across

different tests and groups of subjects. Hambleton and Swaminathan (1984) pinpointed

the three key advantages of LTT:

(1) Item parameter estimates are independent of the group of examinees used.

(2) Test taker ability estimates are independent of the particular set of test items used.

(3) Precision of ability estimates are known.

The first two advantages are known as invariance, the major distinctive feature

distinguishing LTT from CTT. Invariance indicates that the ability distribution of

test-takers does not affect item characteristics and that item characteristics do not

affect the ability distribution of test-takers. Even if a test is administrated to two

groups of test-takers with different ability, the resulting item characteristics will still

be identical.

Compared to CTT, LTT is established on stronger assumptions. One key

assumption, unidimensionality, indicates that a single trait suffices to explain the test

performance of a subject. That is, there would be only one single dominant trait that

can account for the score pattern and ranking of the test takers. If a test requires more

than two dominant traits to explain the performance, it is called a multidimensional

model, which lacks sufficient empirical support (Davies et al., 1999; Hambleton &

Swaminathan, 1984).

Another central assumption is local independence, which assumes that “when

abilities influencing test performance are held constant, examinees’ responses to any

pair of items are statistically independent (Hambleton et al., 1991).” If local

independence is achieved, a test taker’s responses to different items will exhibit no

relationship. This condition can be achieved when traits other than the ability to be

measured are ruled out.

There is an item characteristic curve (ICC) for each test item which showcases

the relationship between the latent trait and the probability of answering correctly.

There can be an infinite number of LTT models with different combination of

parameters. Among the most commonly used model are one-parameter logistic model,

two-parameter logistic model, and three-parameter logistic model.

One parameter logistic model, also known as the Rasch model, takes into

account one sole parameter: item difficulty, which denotes bi. The bi parameter

resides on the ability scale where the chances of answering correctly stand at 0.5. As

Figure 1 indicates, the nearer the curve is to the right end of the ability scale, the more

difficult the item is. To have a 50% chance of answering item 2 correctly, a test-takers

needs to have a bi higher than 2, while test-takers only need a bi higher than -1 to

answer item 3 correctly. The only difference between curves in the one parameter

logistic model is their location on the ability scale. This model requires at least 100

participants and about 25 items to produce reliable results.

Figure 3.

An Example of One

As for the two-parameter logistic model, it takes item discrimination into account

along with item difficulty. The item discrimination parameter, denoting

shown by how steep the slope is. An item

curve than items with lower discrimination. The acceptable range of

(0, 2). The higher the value is, the more

Figure 4, ICCs presented in the two

location as well as slopes. While item 1 and item 2 are the most difficult ones, item 4

possesses better discriminating power since it has the steepest slope among the four

curves. Alderson et al (1995) proposed that it takes a minimum of 200 subjects to get answer item 3 correctly. The only difference between curves in the one parameter

logistic model is their location on the ability scale. This model requires at least 100

participants and about 25 items to produce reliable results.

An Example of One-Parameter Logistic Model. Adopted from Hambleton, 1991

parameter logistic model, it takes item discrimination into account

along with item difficulty. The item discrimination parameter, denoting

shown by how steep the slope is. An item possessing high discrimination has a steeper

than items with lower discrimination. The acceptable range of ai

(0, 2). The higher the value is, the more discriminative the item is. As indicated by

Figure 4, ICCs presented in the two-parameter logistic model differ in terms of their

as slopes. While item 1 and item 2 are the most difficult ones, item 4

possesses better discriminating power since it has the steepest slope among the four

curves. Alderson et al (1995) proposed that it takes a minimum of 200 subjects to get answer item 3 correctly. The only difference between curves in the one parameter

logistic model is their location on the ability scale. This model requires at least 100

Parameter Logistic Model. Adopted from Hambleton, 1991

parameter logistic model, it takes item discrimination into account

along with item difficulty. The item discrimination parameter, denoting ai, can be

high discrimination has a steeper

remains within

ve the item is. As indicated by

parameter logistic model differ in terms of their

as slopes. While item 1 and item 2 are the most difficult ones, item 4

possesses better discriminating power since it has the steepest slope among the four

curves. Alderson et al (1995) proposed that it takes a minimum of 200 subjects to get

Figure 4.

An Example of Two

Figure 5. An Example of Three

the two parameter model running while McNamara (1996) suggested that 500

subjects and 20 items are required.

The three-parameter logistic model includes an additional parameter

pseudo-chance-level parameter. It allows the observation on the lower end of the

ability scale, providing valuable information on tests using

An Example of Two-Parameter Logistic Model. Adopted from Hambleton, 1991

An Example of Three-Parameter Logistic Model. Adopted from Hambleton, 1991

running while McNamara (1996) suggested that 500

20 items are required.

parameter logistic model includes an additional parameter

level parameter. It allows the observation on the lower end of the

ability scale, providing valuable information on tests using selective response, Parameter Logistic Model. Adopted from Hambleton, 1991

Parameter Logistic Model. Adopted from Hambleton, 1991

running while McNamara (1996) suggested that 500

parameter logistic model includes an additional parameter ---

level parameter. It allows the observation on the lower end of the

response, like

multiple choice questions. Figure 5 is an example of the three parameter model. As

shown in Figure 3, the lower asymptote of each curve is no longer zero. Item 3, with

its guessing value at 0.25, is more susceptible to guessing in comparison to the other

items with lower guessing values. Since the 3PL model involves three parameters, it

requires a minimum of 1000 participants and 60 items.

Aside from providing information on test items and latent traits, Rasch analysis

can single out problematic test items and subjects, which is done through estimation

of goodness of fit, namely, fit analysis. When fit analysis is conducted, the latent trait

of examinees is calculated, and then predictions of an examinee’s performance are

made based on the calculation. If the predictions match the observed behavior, then

goodness of fit is attained. If there are discrepancies between the two, misfit items or

person will be located.

Misfit items suggest two possible situations. On one hand, they might be poorly

written items with low discriminating power, which need further revision. On the

other hand, those items might not be assessing the one latent trait the whole test

claims to measure. Such information enables test developers to identify, revise, or

exclude poor items and therefore refine a test. Likewise, poor person fit statistics

imply that the test does not accurately estimated the ability of the misfit person. It

might happen when a person randomly guesses through the whole test or when a test

proves a poor measurement for a large population of examinees (Davies et al., 1999;

McNamara & Candlin, 1996).

在文檔中應用潛在特質理論發展與驗證一單字測驗 (頁 35-41)