Heaton (1990) proposed that the two basic principles for a test are validity and
reliability. According to Bachman and Palmer (1996), the most important quality of a
test is its usefulness, which is determined largely by how test takers perform on the
test. How test takers respond to test items and the outcome of those responses will be
the main concern when it comes to the evaluation of test usefulness. Among the
qualities determining the usefulness of a test, reliability and construct validity are of
primary concern (Bachman, 2004).
Hughes (2003) proposed the basic principles a test or test system should be
evaluated against, as is shown below:
consistently provides accurate measures of precisely the abilities in which we are interested;
has a beneficial effect on teaching;
is economical in terms of time and money.
Still, the priority is to make sure that the abilities being measured are what we
intend to measure. In the following section, the concepts of reliability and validity
will be reviewed.
Reliability
Reliability represents the consistency of measurement (Bachman & Palmer,
1996). Theoretically speaking, if a subject takes a test twice, he should get identical
scores. Also, if a group of test-takers take a test twice and are ranked according to
their performances, their ranking for the two tests should be exactly the same.
However, in reality, such ideal is not easy to achieve since a lot of factors other than
the ability to be measured can stand in the way to obtaining a reliable test result. The
factors interfering with test scores include test method facets, attributes of the
test-taker that are not considered part of the target language ability, and unpredictable,
temporary factors (Bachman, 1990). For example, if the items in a vocabulary test are
too long, the test-takers might get items wrong simply because they are too fatigued to
pay full attention to the content of the test. Emotional or physical conditions might
also prevent the test-takers to perform as well as they should. Temporary factors such
as the administrative process, and external conditions like noises or temperature might
also temper with the test results. These factors lead to inconsistency of test results, or
measurement errors. Therefore, it’s of paramount importance to minimize the effect
such factors have on the representativeness of the test score so that a reliable test
result can be obtained. The lower the degree of fluctuation the test results show, the
more reliable a test is. In other words, if measurement error can be minimized,
reliability can be ensured (Hughes, 2003).
Reliability can be empirically tested by means of Classic Test Theory. One
approach is called the test-retest method. Subjects are asked to take the same test
twice, and the two sets of test scores will be compared for reliability coefficients.
However, this method is difficult to practice. If a subject takes the two tests in too
short an interval, he or she might be able to recall the answer. If the two tests are
administrated in too long an interval, practice effect may get in the way (Davies et al.,
1999; Schmitt, 2010).
The other method, the spilt half method, is much more popular among
researchers since it requires only one administration. Instead of administrating
identical tests, a test will be split in two parallel forms to be analyzed for coefficient
of internal consistency. Yet, the prerequisite that the two halves of the test should be
proven to be equivalent is difficult to be satisfied (Hughes, 2003; N. Schmitt, Schmitt,
& Clapham, 2001). In fact, the theoretical assumption of reliability as “the correlation
between test scores on two parallel tests” is difficult to be operationalized (Hambleton,
Swaminathan, & Rogers, 1991).
In view of the limitations of Classic Test Theory in accurately gauging reliability,
Item Response Theory, capable of estimating the standard error of measurement
across ability levels and test-takers, has been adopted to obtain a more proper estimate
of reliability (Schmitt, 2010). The theoretical construct of Latent Trait Theory and
how it achieves a more accurate estimate of reliability will be discussed in detail in
the section of LTT.
Validity
If a test exhibits validity, it measures the ability or skill the test is intended to
measure. Every test has its own purpose or use, be it for replacement or for
assessment. If the test result does not reflect the ability the test claims to measure, the
decisions made based on the test scores would lose its credibility. Hence, test
developers should make sure the interpretation of test results does the test-takers
justice. Also, given the washback effect of tests, the inferences we make from test
scores should be correct and appropriate to prevent unwanted social and educational
consequences (Messick, 1980, 1996). To design a valid test, it is necessary to devise
the specifications of the ability to be measured. What’s more, the interpretation and
utility of the test scores that further action is based on should also be examined to
validate a test. As Bachman (1990) notifies, it is the “validity of the way we interpret
or use the information gathered through the testing procedure” that we are examining
(Cronbach, 1971; Messick, 1990). In a nutshell, the process of validation should be
based on logical, empirical and ethical considerations.
Validity is a unitary concept comprised of different types of validity. Construct
validity, first used in the field of psychological testing, is increasingly deemed the
overarching concept of validity in recent years (Davies et al., 1999; Hughes, 2003).
Construct here means a theorized definition of the ability or skill the test aims to
measure. Construct validity refers to the degree to which a test can accurately gauge
the theoretical construct of the ability of a test subject (Anastasi & Urbina, 1997;
Bachman, 1990; Hughes, 2003). In addition, the interpretation of test results should
allow further generalization to a bigger target language area (Bachman & Palmer,
1996). Under the umbrella of construct validity are three most commonly discussed
facets of validity: content validity, criterion-related validity, and face validity. The first
two is generally considered empirical evidence for the validation of a test.
Content validity refers to whether test items can properly represent the target
language skills or structures the test is intended to measure. For instance, if a grammar
test focuses on subjects’ knowledge of present tense, then the test items should
capture all the concepts related to present tense. If present progressive is not included,
then the test cannot be said to have content validity, even if it covers all the other
concepts of present tense. To achieve content validity, developers of a test must first
specify the domain of the ability that would be covered. That is, specifications of the
skills or structures should be spelled out. Specifications of a test include content,
structure, timing, medium, techniques, criteria levels of performance, and scoring
procedures (Hughes, 2003). What is worthy of note is that “content” refers to the
potential content of any version of a test rather than a specific single version of it.
Empirical support for content validity can be derived by means of Factor Analysis or
LLT analysis (Davies et al., 1999).
Criterion-related validity denotes the correlation between the test in question and
a well-established, reliable assessment. If the criterion is a test administrated at about
the same time, then it is concurrent validity that is being measured. If the criterion is a
test or performance assessment which will be conducted in the future, then predictive
validity is being calculated.
Concurrent validity can be established in two ways. One is to examine how a test
can discriminate among test-takers of different levels of proficiency. The other is to
compare results of the test in question to those of a standardized or well-established
test measuring the same ability or skill (Bachman, 1990). Both methods have their
own issues. The former is less used by researchers due to the difficulty in defining the
standard of proficiency to compare against. The latter requires the criterion measure to
have undergone construct validation, the only sufficient evidence for validity, before
being taken as a criterion to compare against (Messick, 1990; Thorndike, 1949). Since
it is not easy to find a valid and reliable test to serve as criterion, it is not easy to
establish empirical evidence for concurrent validity.
As for predictive validity, it primarily focuses on how the test scores can predict
criterion behavior, or performance in the future. If a test aims to screen students for
admission into a language gifted class, its criterion would be the performance of the
students who are recruited into the class. A potential pitfall in the use of predictive
validity is that such a test might not cover an ability holistically. For instance, if a
vocabulary size test can accurately predict future performance for students in
language talented classes, it doesn’t mean such a test amounts to a valid test on
overall proficiency. As Bachman (1990) points out, without empirical evidence,
“language tests developed for purposes of prediction…cannot…be interpreted as valid
measures of any particular ability.”
Face validity involves the perception of untrained observers on the test in
question. If the test looks as if measuring what it claims to assess, then it has face
validity. Although face validity has virtually no effect on the empirical establishment
of construct validity, the lack of it can cause the test-takers to take the test lightly and
consequently jeopardize the test results (Davies et al., 1999; Hughes, 2003).