Important Criteria for a Good Vocabulary Size Test

Heaton (1990) proposed that the two basic principles for a test are validity and

reliability. According to Bachman and Palmer (1996), the most important quality of a

test is its usefulness, which is determined largely by how test takers perform on the

test. How test takers respond to test items and the outcome of those responses will be

the main concern when it comes to the evaluation of test usefulness. Among the

qualities determining the usefulness of a test, reliability and construct validity are of

primary concern (Bachman, 2004).

Hughes (2003) proposed the basic principles a test or test system should be

evaluated against, as is shown below:

consistently provides accurate measures of precisely the abilities in which we are interested;

has a beneficial effect on teaching;

is economical in terms of time and money.

Still, the priority is to make sure that the abilities being measured are what we

intend to measure. In the following section, the concepts of reliability and validity

will be reviewed.

Reliability

Reliability represents the consistency of measurement (Bachman & Palmer,

1996). Theoretically speaking, if a subject takes a test twice, he should get identical

scores. Also, if a group of test-takers take a test twice and are ranked according to

their performances, their ranking for the two tests should be exactly the same.

However, in reality, such ideal is not easy to achieve since a lot of factors other than

the ability to be measured can stand in the way to obtaining a reliable test result. The

factors interfering with test scores include test method facets, attributes of the

test-taker that are not considered part of the target language ability, and unpredictable,

temporary factors (Bachman, 1990). For example, if the items in a vocabulary test are

too long, the test-takers might get items wrong simply because they are too fatigued to

pay full attention to the content of the test. Emotional or physical conditions might

also prevent the test-takers to perform as well as they should. Temporary factors such

as the administrative process, and external conditions like noises or temperature might

also temper with the test results. These factors lead to inconsistency of test results, or

measurement errors. Therefore, it’s of paramount importance to minimize the effect

such factors have on the representativeness of the test score so that a reliable test

result can be obtained. The lower the degree of fluctuation the test results show, the

more reliable a test is. In other words, if measurement error can be minimized,

reliability can be ensured (Hughes, 2003).

Reliability can be empirically tested by means of Classic Test Theory. One

approach is called the test-retest method. Subjects are asked to take the same test

twice, and the two sets of test scores will be compared for reliability coefficients.

However, this method is difficult to practice. If a subject takes the two tests in too

short an interval, he or she might be able to recall the answer. If the two tests are

administrated in too long an interval, practice effect may get in the way (Davies et al.,

1999; Schmitt, 2010).

The other method, the spilt half method, is much more popular among

researchers since it requires only one administration. Instead of administrating

identical tests, a test will be split in two parallel forms to be analyzed for coefficient

of internal consistency. Yet, the prerequisite that the two halves of the test should be

proven to be equivalent is difficult to be satisfied (Hughes, 2003; N. Schmitt, Schmitt,

& Clapham, 2001). In fact, the theoretical assumption of reliability as “the correlation

between test scores on two parallel tests” is difficult to be operationalized (Hambleton,

Swaminathan, & Rogers, 1991).

In view of the limitations of Classic Test Theory in accurately gauging reliability,

Item Response Theory, capable of estimating the standard error of measurement

across ability levels and test-takers, has been adopted to obtain a more proper estimate

of reliability (Schmitt, 2010). The theoretical construct of Latent Trait Theory and

how it achieves a more accurate estimate of reliability will be discussed in detail in

the section of LTT.

Validity

If a test exhibits validity, it measures the ability or skill the test is intended to

measure. Every test has its own purpose or use, be it for replacement or for

assessment. If the test result does not reflect the ability the test claims to measure, the

decisions made based on the test scores would lose its credibility. Hence, test

developers should make sure the interpretation of test results does the test-takers

justice. Also, given the washback effect of tests, the inferences we make from test

scores should be correct and appropriate to prevent unwanted social and educational

consequences (Messick, 1980, 1996). To design a valid test, it is necessary to devise

the specifications of the ability to be measured. What’s more, the interpretation and

utility of the test scores that further action is based on should also be examined to

validate a test. As Bachman (1990) notifies, it is the “validity of the way we interpret

or use the information gathered through the testing procedure” that we are examining

(Cronbach, 1971; Messick, 1990). In a nutshell, the process of validation should be

based on logical, empirical and ethical considerations.

Validity is a unitary concept comprised of different types of validity. Construct

validity, first used in the field of psychological testing, is increasingly deemed the

overarching concept of validity in recent years (Davies et al., 1999; Hughes, 2003).

Construct here means a theorized definition of the ability or skill the test aims to

measure. Construct validity refers to the degree to which a test can accurately gauge

the theoretical construct of the ability of a test subject (Anastasi & Urbina, 1997;

Bachman, 1990; Hughes, 2003). In addition, the interpretation of test results should

allow further generalization to a bigger target language area (Bachman & Palmer,

1996). Under the umbrella of construct validity are three most commonly discussed

facets of validity: content validity, criterion-related validity, and face validity. The first

two is generally considered empirical evidence for the validation of a test.

Content validity refers to whether test items can properly represent the target

language skills or structures the test is intended to measure. For instance, if a grammar

test focuses on subjects’ knowledge of present tense, then the test items should

capture all the concepts related to present tense. If present progressive is not included,

then the test cannot be said to have content validity, even if it covers all the other

concepts of present tense. To achieve content validity, developers of a test must first

specify the domain of the ability that would be covered. That is, specifications of the

skills or structures should be spelled out. Specifications of a test include content,

structure, timing, medium, techniques, criteria levels of performance, and scoring

procedures (Hughes, 2003). What is worthy of note is that “content” refers to the

potential content of any version of a test rather than a specific single version of it.

Empirical support for content validity can be derived by means of Factor Analysis or

LLT analysis (Davies et al., 1999).

Criterion-related validity denotes the correlation between the test in question and

a well-established, reliable assessment. If the criterion is a test administrated at about

the same time, then it is concurrent validity that is being measured. If the criterion is a

test or performance assessment which will be conducted in the future, then predictive

validity is being calculated.

Concurrent validity can be established in two ways. One is to examine how a test

can discriminate among test-takers of different levels of proficiency. The other is to

compare results of the test in question to those of a standardized or well-established

test measuring the same ability or skill (Bachman, 1990). Both methods have their

own issues. The former is less used by researchers due to the difficulty in defining the

standard of proficiency to compare against. The latter requires the criterion measure to

have undergone construct validation, the only sufficient evidence for validity, before

being taken as a criterion to compare against (Messick, 1990; Thorndike, 1949). Since

it is not easy to find a valid and reliable test to serve as criterion, it is not easy to

establish empirical evidence for concurrent validity.

As for predictive validity, it primarily focuses on how the test scores can predict

criterion behavior, or performance in the future. If a test aims to screen students for

admission into a language gifted class, its criterion would be the performance of the

students who are recruited into the class. A potential pitfall in the use of predictive

validity is that such a test might not cover an ability holistically. For instance, if a

vocabulary size test can accurately predict future performance for students in

language talented classes, it doesn’t mean such a test amounts to a valid test on

overall proficiency. As Bachman (1990) points out, without empirical evidence,

“language tests developed for purposes of prediction…cannot…be interpreted as valid

measures of any particular ability.”

Face validity involves the perception of untrained observers on the test in

question. If the test looks as if measuring what it claims to assess, then it has face

validity. Although face validity has virtually no effect on the empirical establishment

of construct validity, the lack of it can cause the test-takers to take the test lightly and

consequently jeopardize the test results (Davies et al., 1999; Hughes, 2003).

在文檔中應用潛在特質理論發展與驗證一單字測驗 (頁 27-34)