Validity issues: accuracy, quality control; timely;
meaningful; misuse issues; challenges; retakes 11. Item banking Security issues; usefulness, flexibility; principles for
effective item banking 12. Test technical
report
Systematic, thorough, detailed documentation of validity evidence; 12-step organization; recommendations
In Downing’s framework, purposes are crucial to the development of an overall
plan for the test. They are critical in determining test content, in selecting the format
of test items, and in guiding the evaluation of test uses and interpretations of scores.
Once the purposes of the test are clarified, the test content should then be precisely
specified. Clear content specifications “specify the number or proportion of items that
assess each content and process/skill area; the format of items, responses, and scoring
rubrics and procedures; and the desired psychometric properties of the items and test
such as the distribution of item difficulty and discrimination indices.” (AERA, APA,
& NCME, 1999, p.183).
As Downing notes, appropriate identification of test content facilitates effective
test items, which are designed to “measure important content at an appropriate
12
cognitive level.” Downing emphasizes the importance of effective item creation and
describes it as “more art than science.” In many large-scale high-stakes tests, the
multiple choice format is widely used because multiple choice items can be
administered in a short time and the test taker responses can be objectively and
efficiently scored. Despite research evidence for the principles of writing effective
multiple choice items (Alderson, Clapham, & Wall, 1995; Haladyna, 2004; Haladyna
& Downing, 1989, 2004; Haladyna, Downing, & Rodriguez, 2002), the creation of
effective test items remains a challenge for test developers. As Downing states:
…The principles of writing effective, objectively scored
multiple-choice items are well-established and many of these principles have a solid basis in the research literature… Yet, knowing the
principles of effective item writing is no guarantee of an item writer’s ability to actually produce effective test questions. Knowing is not necessarily doing. Thus, one of the more important validity issues associated with test development concerns the selection and training of item writers…The most essential characteristic of an effective item writer is content expertise…many other item writer characteristics such as regional geographic balance, content subspecialization, and racial, ethnic, and gender balance must be considered in the selection of item writers.
(Downing, 2006, p.11)
Weir’s Framework for Language Test Development
Based on current developments in theory and practice, Weir (2005) presents a
coherent “evidence-based validity” framework for language test design and
implementation, which involves providing evidence relating to context validity,
13
theory-based validity, criterion-based validity, scoring validity, and consequential
validity. According to Weir, the framework is specifically designed for English for
Speakers of Other Languages (ESOL), but also applies to all forms of educational
assessment. In the construction of tests, test developers must provide all of the five
types of validity evidence to “justify the correctness of our interpretations of abilities
from test scores” (p.2). Among the five types of evidence, context validity and
theory-based validity evidence, which are collected before the test event, are
concerned with what abilities a test is intended to measure and how the choice of tasks
in a test is representative of the abilities required in “real life language use.” The other
three types of validity evidence (i.e., scoring validity, criterion-based validity, and
consequential validity), which are generated after the test has been administered, are
concerned about the reliability of test scores, the extent to which test scores correlate
with external criteria of real life performance, and the consequences of test use for test
stakeholders: learners, teachers, parents, government and official bodies, and the
marketplace.
Weir further describes in detail a socio-cognitive framework specifically
designed for validating reading tests, which is presented as a flowchart of boxes, from
test taker characteristics, theory-based validity, and context validity, to scoring validity,
consequential validity, and criterion-related validity. As shown in Figure 1, Weir’s
14
framework provides us with insights into what type of evidence can be collected at
different stages of reading test construction and how the different types of validity
evidence fit together.
Figure 1. A Socio-cognitive Framework for Validating Reading Skills
From Language testing and validation: An evidence-based approach (p.44), by C. J.
Weir, 2005. New York: Palgrave Macmillan.
15
Research in Second Language Reading
This section provides an overview of research in second language reading. We
will begin with theoretical accounts of reading comprehension and reading strategies,
which pave the way for a later review of research in second language reading.
Reading Comprehension and Strategy Use
Reading comprehension has been discussed from a number of perspectives. In a
review of reading comprehension research, Pressley (2000) summarizes that
comprehension depends on a number of lower order processes (e.g., skilled decoding
of words) and higher order processes (e.g., relating text content to background
knowledge; use of comprehension strategies). As Pressley notes, reading
comprehension “begins with decoding of words, processing of those words in relation
to one another to understand the many small ideas in the text, and then, both
unconsciously and consciously, operating on the ideas in the text to construct the
overall meaning encoded in the text” (p.551). Along with previous research, Pressley
confirms that accurate and fluent (automatic) word recognition is a prerequisite for
reading comprehension (Carver, 1997; LaBerge & Samuels, 1974; Perfetti, 1997;
Pressley, 1998, 2000). During the process of reading, language comprehension
processes interact with higher-level processes. Readers may automatically relate text
content to prior knowledge and/or consciously activate comprehension strategies.
When readers’ activation of schematic knowledge is relevant to the information in the
16
text, reading is successful. While good readers typically make inferences based on
prior knowledge directly relevant to the ideas in the text, poor readers make
“unwarranted and unnecessary” inferences by drawing on prior knowledge not
directly relevant to the most important ideas in the text (Anderson & Pearson, 1984;
Hudson, 1990; Hudson & Nelson, 1983; Rosenblatt, 1978; Williams, 1993).
In terms of strategy use, good readers use a variety of strategies, including
being aware of reading purposes, overviewing the text, reading selectively, making
associations, evaluating and revising hypotheses, revising prior knowledge, figuring
out unknown words in text, underlining and making notes, interpreting text,
evaluating the text, reviewing the text, and using the information in the text (Pressley
& Afflerbach, 1995). Given the importance of both lower order and higher order
processes in reading comprehension, Pressley further suggests that teachers promote
learners’ comprehension abilities by improving word-level competences, building
background knowledge, and promoting use of comprehension strategies.
In developing a proposed research agenda for reading comprehension, the
RAND Reading Study Group (2002) defines reading comprehension as “the process
of simultaneously extracting and constructing meaning through interaction and
involvement with written language” (p.11). According to the proposal, three key
elements are essential in reading comprehension: the reader, the text, and the activity
17
(e.g., purpose for reading, processes while reading, and consequences of reading). In
reading comprehension, the three elements are interrelated within a larger
sociocultural context which interacts with all of the three elements, as illustrated in
Figure 2.
Figure 2. A Heuristic for Thinking about Reading Comprehension
From Reading for understanding: Toward a R&D program in reading comprehension (p.12), by RAND Reading Study Group, 2002. Santa Monica, CA: Science and Technology Policy Institute, RAND Education.
In this framework, good readers have a wide range of capacities and abilities,
including cognitive capacities (e.g., attention, memory, critical analytic ability,
inferencing, visualization ability), motivation (e.g., a purpose for reading, an interest
in the content being read, self-efficacy as a reader), and different types of knowledge
18
(e.g., vocabulary, domain and topic knowledge, linguistic and discourse knowledge,
knowledge of specific comprehension strategies). Before reading, readers have
purposes in mind. While reading, they process the text with regard to the purposes.
They construct various representations of the text that are important for
comprehension, including the surface code (e.g., the exact wording of the text), the
text base (e.g., idea units representing the meaning), and a representation of mental
models embedded in the text. Reading activities may have direct consequences in
knowledge, application, and engagement, or other long-term consequences. In reading
comprehension, all of the three key elements—the reader, the text, and the activity —
are interrelated within a sociocultural context.
Issues in Second Language Reading
In the context of second language reading, research has stressed the interactive
nature of bottom-up and top-down processing (Bernhardt, 1991; Carrell, Devine &
Eskey, 1988). While reading, readers interact with both bottom-up and top-down
processing. In bottom-up processing, readers “begin with the printed words, recognize
graphic stimuli, decode them to sound, recognize words and decode meaning.” In
top-down processing, readers “activate what they consider to be relevant existing
schemata, and map incoming information onto them” (Alderson, 2000, pp. 16-17).
Drawing on an extensive review of research in reading comprehension, Alderson
19
concludes that bottom-up and top-down approaches are both important in reading and
“the balance between the two approaches is likely to vary with text, reader, and
purpose” (p. 20). According to Alderson, variables that affect the nature of reading are
mainly “the interaction between reader and text variables in the process of reading”
(p.32). Reader variables include schemata and background knowledge, knowledge of
language, knowledge of genre/text type, metalinguistic knowledge and metacognition,
content schemata, knowledge of subject matter/topic, knowledge of the world, cultural
knowledge, reader skills and abilities, reader motivation, reader affect, etc. Text
variables include text topic and content, text type and genre, text organization,
linguistic variables, text readability, typographical features, the medium of text
presentation, etc.
Bernhardt and Kamil (1995) make a thorough review of research and claim that
second language reading is an interaction of L1 reading ability and L2 linguistic
knowledge (e.g., word knowledge and syntax). While L1 literacy accounts for 20% of
the variance of L2 reading ability, L2 linguistic ability accounts for 30% of the
variance (27% of word knowledge and 3% of syntax). A number of studies have
confirmed the contribution of L1 to L2 reading development (Grabe, 2009; Guthrie,
1988; Koda, 2005; Rutherford, 1983). Koda (2005) argues that L1 processing
experience has influence on the development of L2 reading skills. Grabe (2009) also
20
suggests that L1 reading abilities such as metalinguistic awareness and basic cognitive
skills are likely to transfer to L2 reading contexts.
Alderson (1984, 2000) concludes from a number of studies that both L2
language knowledge and L1 reading knowledge are important factors in second
language reading, with L2 language knowledge being a more powerful factor than L1
reading ability. He also confirms that a “linguistic threshold” exists and that L2
learners transfer their L1 reading ability to L2 reading contexts only when they reach
a certain proficiency level. In other words, less proficient L2 learners need to improve
their linguistic knowledge so as to engage themselves in L2 reading. Alderson (2000)
further suggests that learners’ linguistic threshold varies with task: “the more
demanding the task, the higher the linguistic threshold.”
Research in second language reading has shown that fluent word recognition,
processing efficiency, and reading rate are vital in reading comprehension (Alderson,
2000; Bernhardt, 1991, 2000; Grabe, 1991; Grabe & Stoller, 2002; Koda, 1996, 1997).
Insufficient linguistic knowledge constrains second language reading processes
(Alderson, 2000; Bernhardt, 1991, 2000; Brisbois, 1995). Vocabulary difficulty has
consistently been shown to have an effect on comprehension for L1 and L2 readers
(Alderson, 2000; Carver, 1994; Freebody & Anderson, 1983; Hu & Nation, 2000;
Laufer, 1992; Nation, 1990, 2001; Read, 2000; Williams & Dallas, 1984). Brisbois
21
(1995) argues that L2 knowledge is critical in reading comprehension, especially
among learners at the beginning levels. Other studies also suggest that insufficient
vocabulary hinders L2 reading performance (Hu & Nation, 2000; Segalowitz, 1986;
Segalowitz, Poulsen, & Komoda, 1991), and that lower-level processing predominates
the reading process among beginning L2 learners (Clarke, 1979; Horiba, 1993).
Meanwhile, reading speed is related to fluency: with increased L2 proficiency, reading
rate improves (Favreau & Segalowitz, 1982; Haynes & Carr, 1990), and error rate
decreases (Bernhardt, 1991).
Research in Second Language Reading Assessment
This section provides an overview of research in second language reading
assessment. We will first highlight the model proposed by Urquhart and Weir (1998)
and Weir (2005) in the construct of second language reading tests. Next, we will
address issues in assessment, including the use of verbal report in assessment,
individual differences in strategy use, and item difficulty in reading assessment.
Construct of Second Language Reading Tests
In the context of second language reading assessment, how reading ability is
defined affects the construct of a test. One prevalent perspective is to view reading as
a set of comprehension processes (Alderson, 2000; Grabe, 1991, 1999, 2000; Grabe
and Stoller, 2002; Urquhart & Weir, 1998; Weir, 2005) and can be broken down into
22
reading skills and strategies needed for testing purposes (Urquhart & Weir, 1998; Weir,
2005). Based on Urquhart and Weir’s model (1998), Weir (2005) develops a model of
the reading process, as presented in Figure 3, to account for four types of reading, as
shown in Table 2.
Figure 3. Urquhart and Weir’s Model of the Reading Process
From Language testing and validation: An evidence-based approach (p. 92), by C. J.
Weir, 2005. New York: Palgrave Macmillan. It is adapted from Reading in a second language: Process, product and practice (p. 106), by A. H. Urquhart & C. J. Weir, 1998. Harlow: Longman.
23
Discourse topic and main ideas, or structure of text, or relevance to needs.
Search reading to locate quickly and understand information relevant to predetermined needs.
Scanning to locate specific points of information.
Note. From Language testing and validation: An evidence-based approach (p. 90), by C. J. Weir, 2005. New York: Palgrave Macmillan. It is adapted from Reading in a second language: Process, product and practice (p. 123), by A. H. Urquhart & C. J.
Weir, 1998. Harlow: Longman.
In this model, Goalsetter and Monitor, which are metacognitive mechanisms,
“mediate among different processing skills and knowledge sources available to a
reader” and “enable a reader to activate different levels of strategies and skills to cope
with different reading purposes” (Weir, 2005: 95-96). Once the test takers have clear
purposes for reading, they choose the most appropriate strategies in response to the
task demand. The higher the demand of a task is, the more components of the model
are involved (Urquhart and Weir, 1998; Weir 2005).
As illustrated in Table 2, the process of reading involves the use of different
skills and strategies. According to Urquhart and Weir (1998) and Weir (2005), reading
can be global or local comprehension. Global reading is comprehension beyond the
24
sentence level such as reading for main idea or important details, whereas local
reading is comprehension within the sentence level, such as reading for word meaning
or pronominal reference. In a reading test, the demand of a careful reading item at the
global level is usually higher than that of a scanning item since the former requires the
test taker to go through the whole text and activate all components of the model in use
while the latter might just involve a few components. Weir (2005) further points out
that test developers should consider the appropriateness of different questions and
reading strategies for different types of texts. In a scanning test, for example, the text
should provide sufficient and varied specific details for readers. In a careful reading
test, the text should include enough main ideas or important points. In an inferencing
test, the text should include pieces of information that can be linked together. In a
skimming or search reading test, the text should have a clear organization and provide
explicit ideas in the surface level.
Verbal Report in Assessment
Research in reading comprehension assessment has consistently recognized the
importance of investigating the examinees’ cognitive processing, thought process, and
strategy use through verbal report measures as part of the process of test validation
(Afflerbach, 2007; Anderson, 1991; Anderson, Bachman, Perkins, & Cohen, 1991;
Cheng, Fox, & Zheng, 2007; Cohen, 1984, 1988, 2000; Cohen & Upton, 2006, 2007;
25
Ericsson & Simon, 1993; Gass & Mackey, 2000; Green, 1998; Perkins, 1992; Phakiti,
2003; Pressley & Afflerbach, 1995; Urquhart & Weir, 1998; Weir, 2005; Weir, Yang,
& Jin, 2000). Green (1998) defines verbal reports or verbal protocols as “the data
gathered from an individual under special conditions, where the person is asked to
either think aloud or to talk aloud” (p. 1). According to Green, verbal protocols may
be gathered as the task is carried out concurrently (i.e., when the task is carried out) or
retrospectively (i.e., after the task has been carried out). In either concurrent or
retrospective verbal reports, the prompts given to the individual can be non-mediate
(e.g., requests such as ‘keep talking’) or mediate (e.g., requests for explanations or
justifications). Green provides a comprehensive and in-depth overview of the use of
verbal protocols in language assessment and concludes that verbal protocol analysis
has the potential to “elucidate the abilities that need to be measured, and also to
provide a means for identifying relevant test methods and selecting appropriate test
content” (p.120).
Verbal protocol analysis is widely used to probe into the examinees’ use of
reading and test taking strategies during the test (Alderson, 2005; Anderson, Bachman,
Perkins, & Cohen, 1991; Cheng, Fox, & Zheng, 2007; Urquhart & Weir, 1998; Weir,
2005; Weir, Yang, & Jin, 2000). As Ellis (2004) states, “collecting verbal
explanations…would appear, on the face of it, to provide the most valid measure of a
26
learner’s explicit knowledge” (p. 263). Weir (2005) suggests that test developers and
teachers use verbal report measures to investigate the examinees’ mental process
while taking a test. The analysis of verbal reports allows test developers and teachers
to: (1) evaluate whether the test measures what it is intended to measure; (2) compare
the use of reading skills and strategies between good and poor readers.
Individual Differences in Strategy Use
In a review of reading comprehension research, Perfetti (1997) claims that
research on individual differences among readers is crucial to understanding the
nature of reading abilities. In other words, if we want to understand the nature of
reading comprehension, we need to know the sources of individual differences
between good and poor readers. Perfetti suggests that good readers differ from poor
readers in the following aspects: processing efficiencies (e.g., speed & automaticity of
word recognition), word knowledge, processing efficiencies in working memory,
fluency in syntactic parsing and proposition integration, and the development of an
accurate and reasonably complete text model of comprehension.
In the evaluation of second language reading assessment, Weir (2005) claims
that when proficient readers process different reading tasks (e.g., skimming, scanning,
search reading, careful reading) through skills and strategies appropriate to the
purposes of the tasks, then the test measures what it is intended to measure and is
27
valid in terms of theory-based validity. On the contrary, if examinees successfully
process the tasks through test taking strategies instead of applying appropriate reading
skills and strategies, then the test does not measure what it is intended to measure and
provides weak evidence for theory-based validity. According to Weir, typical test
taking strategies include: (1) matching words in the question with the same words in
the text; (2) using clues in other questions to answer the question under consideration;
(3) using prior knowledge to answer the questions; (4) blind guessing not based on
any particular rationale (p. 94).
Pressley and Afflerbach (1995) classify reading strategies into three types:
planning and identifying strategies, by which readers construct the text meaning;
monitoring strategies, by which readers regulate comprehension and learning; and
evaluating strategies, by which readers reflect or respond to the text. Research in
second language learning has shown that readers use the same range of strategies to
comprehend, interpret, and evaluate texts (Carrell & Grabe, 2002; Cohen & Upton,
2007; Upton, Lee-Thompson, & Li-Chun, 2001).
Extensive studies have demonstrated that readers use their prior knowledge to
determine the importance of information in the text and make inferences about the
text. While good readers typically make inferences based on prior knowledge directly
relevant to the ideas in the text, poor readers make inferences by drawing on prior