• 沒有找到結果。

CHAPTER TWO LITERATURE REVIEW

In this chapter, research concerning the following themes will be reviewed. First, I will give an overview of language testing research. Second, I will review studies on how students take tests and on how teachers construct tests. Third, I will review research into the relationship of teachers’ test-constructing processes and students’

test-taking processes. Finally, I will review literature concerning the technique of verbal report in language testing.

Overview of Language Testing Research

Language testing research, a well-established branch of applied linguistics, has evolved and expanded through the years. Bachman (2000), in his state-of-the-art article, chronicled the major developments of testing research in the last two decades of the 20th century and also predicted the future directions for testing research in the 21th century. To gain a rough understanding of the whole testing research and to situate the present study in the testing field, I will briefly review Bachman (2000) in the following.

According to Oller (1979, cited in Bachman, 2000), language testing research, from the mid-1960s through the 1970s, was dominated by the hypothesis that language proficiency consisted of a single unitary trait, and the research methodology used was often a quantitative and statistical one. Then, the 1980s saw the influence of second language acquisition (SLA) research on testing research. Research in SLA spurred language testers to investigate not only a wide range of factors on language test performance (e.g., Douglas & Selinker, 1985; Chapelle, 1988; Hale, 1988), but also the strategies involved in the process of test-taking itself (e.g., Cohen, 1984). It was during this period that research on test-taking process began to emerge. Toward the end of 1980, language testers were challenged by Pienemann et al. (1988) to

explicitly take into consideration language learners’ developmental sequence in the design of language tests and in the interpretation of test scores.

Testing research in the 1990s witnessed expansions in five major areas: (1) research methodology; (2) practical advances; (3) factors that affect performance on language tests; (4) performance assessment; and (5) ethical issues. Each of the five areas will be summarized briefly below.

Methodological approaches employed in language testing research in the 1990s have become increasingly sophisticated and diverse. Newer and more powerful quantitative methods, such as criterion-referenced measurement (Lynch & Davidson, 1997), generalizability theory (Bachman, 1997), item response theory (Pollitt, 1997), and structural equation modeling (Kunnan, 1998), have superseded classical norm-referenced reliability coefficients and exploratory factor analysis. Moreover, qualitative approaches have also been applied to language testing research. They include expert judgments, introspective and retrospective verbal reports, observations, questionnaire and interviews, text analysis, conversational analysis, and discourses analysis (Banerjee & Luoma, 1997).

Concerning practical issues, testing research agenda began to see advances in the areas of cross-cultural pragmatics (e.g., Hudson et al., 1992; 1995), languages for specific purposes (e.g., Douglas, 2000), computer-based assessment (Gruba & Corbel, 1997), and a renaissance in research into the testing of vocabulary (e.g., Read, 2000) and the development of new kinds of vocabulary tests (e.g., Nation, 1990; Laufer &

Nation, 1999).

Regarding factors affecting performance on language tests, research has mainly focused on characteristics of the testing procedure (e.g., Fulcher, 1996; Riley & Lee, 1996), characteristics of test takers (e.g., Hill, 1993), and the test-taking process (e.g., Storey, 1997). A number of the test-taking process studies have used qualitative

methodologies mentioned above, such as verbal reports, questionnaires, and discourse analysis.

“Performance” assessment (McNamara, 1997), or “alternative” or “authentic”

assessment (Herman et al., 1992; Wiggins, 1993) in the 1990s, has been spurred largely by widespread dissatisfaction with standardized multiple-choice tests in the communicative language teaching context, and by the developments in task-based language teaching and assessment (Norris et al., 1998). Performance assessment measures include classroom observation, portfolios, conferences, journals, questionnaires, interviews, self- and peer- assessment, group oral assessment, etc (Brown, 1998).

Ethical issues in the 1990s included research into washback on instruction (e.g., Alderson & Wall, 1993; Wall & Alderson, 1993), ethics of test use (e.g., Lynch, 1997;

Shohamy, 1997), and professionalization of the testing field (e.g., Stansfield, 1993), which includes two interrelated activities: professional training and a code of practice (Davies, 1997).

After reviewing the major developments of testing research in the last two decades of the 20th century, Bachman (2000) also suggested some future directions for testing research in the 21th century. He believes that “there are two areas in which language testing and language testers must continue to grow and develop: the professionalization of the field, and validation research. However, rather than being two disparate directions,…these are two virtually related areas that lie on the same path” (Bachman, 2000, p. 18).

According to Bachman (2000), the professionalization of language testing has two major thrusts: (1) the training of language testing professionals; and (2) the development of standards of practice and mechanisms for their implementation and enforcement. Bachman (2000) further argues that “we will need not only to develop

standards of professional competence in language assessment, but also to become more active advocates for the inclusion of such standards in the standards for the training and certification of language teachers” (p. 20).

In regard to validation research, Bachman (2000) believes that the research in the past decade into factors and processes that affect language test performance and test scores will continue to blossom. In addition, “the debate over methodological issues has …moved from an overly simplistic view of the incompatibility of quantitative and qualitative approaches to a greater appreciation of their complementarity and of the necessity for including a range of approaches in our research agendas” (Bachman, 2000, p. 22).

In conclusion of his article, Bachman (2000) voices again that the professionalization of our field and validation research will continue to be vital to language testing. Bachman (2000) believes that:

Language testing will grow as a profession in the twenty-first century to the extent that it effectively marshals the resources at its disposal to continue to vigorously investigate the validity of the inferences we make on the basis of test scores and the fairness of the uses we make of these scores. Validity and fairness are issues that are at the heart of how we define ourselves as professionals, not only as language tester, but also as applied linguists. (p. 25) After reviewing Bachman’s (2000) overview article of the testing research, I think my present study well-fitted into the future directions mentioned in Bachman (2000). On one hand, my research focus on teachers’ test-constructing processes is in line with research into “the professionalization of the field.” On the other hand, my research concern of students’ test-taking processes helps contribute to the “validation research.” It is against this backdrop that the present study unfolds.

Studies on Students’ Test-taking Process

In this section, we will review several verbal report studies on students’

test-taking process in reading tests, the focus of the current study.

Early Attempts

Cohen (1984) is one of the early efforts examining test-taking process by using verbal report data. The main purpose of Cohen (1984), which described the results of five student course papers, was to discuss methods for obtaining verbal report data on L2 test-taking strategies and to report on some types of the findings obtained. The verbal report methods explored in Cohen (1984) included think-aloud and self-observation (i.e., introspection, immediate retrospection, and delayed retrospection). The data obtained EFL and ESL students’ test-taking strategies on cloze tests and multiple-choice reading comprehension tests. The results, in general, showed that not all of the students read the entire cloze passage or reading passage before answering the test items although they were requested to do so. In terms of cloze tests, it was found that some students did not use the context to find clues for filling in the blank, and that students would use the strategy of translation in doing cloze tests. Moreover, when not knowing how to fill in a blank, poor students would leave it blank, and better students would make guesses. In terms of multiple-choice tests, students reported either reading the questions first or just part of the article and then looking for the corresponding questions. Moreover, students would use a strategy of matching material from the passage with material in the item stem and in the alternatives. In sum, Cohen (1984) has demonstrated some ways how verbal report data can be obtained, and the paper concluded that “there is value in striving for a closer fit between how test constructors intend for their tests to be taken and how respondents actually take them” (Cohen, 1984, p. 79).

Following Cohen (1984), we will first, in the following, review other verbal report studies on multiple-choice reading comprehension tests (e.g., Nevo, 1989;

Anderson et al., 1991; Rupp, Ferne & Choi, 2006), and then review those on cloze tests (e.g., Storey, 1997; Sasaki, 2000; Yamashita, 2003; Moghaddam, 2010).

Studies on Multiple-choice Reading Comprehension Tests

Nevo (1989) examined students’ test-taking strategies on a multiple-choice reading comprehension test by adopting the methods of immediate introspective verbal report and retrospective report. Forty-two Hebrew students studying French participated in the study, and they were asked to complete a multiple-choice test on four reading passages (two in Hebrew and the other two in French). An innovation of Nevo (1989) is that a checklist of fifteen strategies was provided for students to facilitate their reporting of strategy use after completing each item of the test. The results showed that there was a transfer of strategies from L1 (Hebrew) to L2 (French), and that the most frequently used strategies in both languages were returning to the passage and clues in the text. It was also found that in L2, students used more strategies which did not lead to the correct answer than in their L1. Finally, the major contribution of Nevo (1989) to the verbal report method is that by providing a checklist, it is possible to obtain feedback from students about their strategy use on an item-by-item basis.

Anderson et al. (1991) presented the results of an exploratory study that examined three types of information (test-taking strategies, item content, and item performance) in the investigation into the construct validity of a reading comprehension test. The participants were twenty-eight Spanish-speaking students, and they were asked to produce retrospective think-aloud protocols while taking an English reading comprehension test, which contained forty-five multiple-choice questions. The results were as follows. First, there was a statistically significant association between students’ reported strategies and the three question types determined by the test developers. Second, students’ strategy use was significantly related to item difficulty and to item discrimination. More specifically, five strategies were worthy of note in the study. First, the strategy stating failure to understand

occurred more frequently on inference test items, was used fewer times on easy items, and was used more times on items that discriminated well among those students who scored high on the test. Second, paraphrasing occurred more frequently on items asking students to identify the direct statement of the passage, and was used more times on items classified as acceptable in terms of discrimination. Third, guessing was reported more times on inference items, fewer times on easy items, and occurred about as often on acceptable and rejected items in terms of item discrimination.

Fourth, matching the stem with a previous portion of the text was reported fewer times on items directed at identifying the main idea, reported fewer times on easy items, and reported fewer times on acceptable items. Fifth, making references to time allocations was reported fewer times on inference questions and more times on acceptable items.

This study has thus showed us the value of test-taking protocols, along with other data sources, in the investigation of construct validity of a reading comprehension test.

Rupp, Ferne, and Choi (2006) examined test-takers’ use of strategies on a multiple-choice reading comprehension test, with a purpose of investigating the equivalence of reading processes and strategy use in testing and non-testing reading conditions. The participants were ten ESL adult learners, who were first asked to verbally report their test-taking process in a semi-structured interview and then were asked to do concurrent think-aloud. The results showed that reading processes in a test condition were strikingly different from those in a non-testing context. Moreover, the construct of reading comprehension was shown to be assessment specific and was fundamentally determined through item design and text selection. In terms of learner strategies, the study presented three findings. First, learners viewed responding to multiple-choice questions as a problem-solving task rather than a comprehension task.

Second, learners selected a variety of unconditional and conditional response strategies to deliberately select choices. Third, learners combined a variety of mental

resources interactively when determining an appropriate choice. In sum, the authors concluded that their findings support the development of response process models that are specific to different item types, the design of further experimental studies of test method effects on response processes, and the development of questionnaires that profile response processes and strategies specific to different item types.

Studies on Cloze Tests

Storey (1997), employing the methods of concurrent think aloud and immediate retrospection, investigated twenty-five Hong Kong EFL students’ test-taking process in a 13-item, multiple-choice, discourse cloze test. The purpose of the study was to provide introspective validation of the testing technique and the test items by assessing observed test-taking behavior against a predicted model of ideal performance. The results revealed that different items entailed varying degrees of construct validity. Some students were found to have used theoretically expected reading processes, while others merely considered information at the within-sentence level. Although there was a mismatch between the theoretically assumed processes and the actual processes applied by some test-takers (such as use of the strategies of elimination and surface matching), the items were capable of generating construct-relevant processing, and the test was judged to have a good degree of construct validity.

Sasaki (2000) investigated how content schemata activated by culturally familiar words might have influenced students’ test-taking processes in a cloze test. Sixty Japanese EFL students were divided into two groups, each completing either a culturally familiar or an unfamiliar version of a cloze test. The participants were asked to produce immediate retrospective protocols while taking the test, and then to recall the passage after they had completed the whole test. The results showed that those who read the culturally familiar cloze text tried to solve more items and generally

understood the text better, which resulted in better performances than those of the students who read the unfamiliar text. The paper concluded it has demonstrated the merits of using multiple data sources for investigating students’ test-taking processes, and that the results also support the claim that cloze tests can measure higher-order processing abilities.

Replicating Sasaki’s (2000) experiment, Moghaddam (2010) examined the effects of cultural schemata on Iranian students’ test-taking processes in a cloze test.

The participants were 116 Iranian university students, who were divided into two groups, each completing either a culturally familiar or a culturally unfamiliar version of a cloze test. They were asked to develop retrospective protocols of their test-taking process and recalls of the cloze passage. Similar to the findings of Sasaki (2000), the results of Moghaddam (2010) showed that students who read the culturally familiar cloze text generally understood the text to a greater extent and resulted in a high score in comparison with those who read the unfamiliar text. Both Sasaki (2000) and Moghaddam (2010) suggested that cultural schemata has certain effect on students’

test-taking processes in cloze tests.

Yamashita (2003) compared skilled and less skilled readers in their processes of taking a gap-filling cloze test. Twelve Japanese EFL students (six skilled and six less skilled) were required to complete a 16-item gap-filling test while thinking aloud about their test-taking processes; afterward, they were interviewed informally by the researcher. The results demonstrated that both skilled and less skilled students used text-level information more frequently than other types of information (such as clause-level, sentence-level, and extra-textual information). However, the skilled readers used text-level information more frequently than the less skilled readers. In sum, the gap-filling test generated processes that made readers utilize text-level constraints, and overall differentiated well between skilled and less skilled readers.

We have reviewed several studies on students’ test-taking processes or strategies so far. Although those studies were conducted for different purposes (e.g., validation, comparison of L1 and L2, comparison of skilled and less skilled readers, or the effects of cultural schemata), they all employed the method of verbal report in their experiment. It can be seen clearly that verbal report has been widely utilized as a means of collecting qualitative data. As Sasaki (2000) comments well, “The product- and process-oriented data complemented each other, providing insights that could not have been gained in the absence of one or the other” (p. 107). In the present study, I will also use verbal report to examine Taiwanese students’ test-taking process in a reading test.

Studies on Teachers’ Test Construction

“Classroom teachers are in the front line of introducing students to formal learning, including assessment” (Leighton, et al., 2010, p. 7). That is, the first test students take in class is usually made by their teachers, and it is also their classroom teachers that prepare them for the formal, large-scale tests. Therefore, teachers’

“assessment literacy” (Stiggins, 1991) is very important. According to Stiggins (1991),

“teacher assessment literacy [emphasis in the original] is characterized by understanding what it takes to produce high-quality achievement data for both classroom and large-scale tests, scrutinizing achievement data and not accepting it at face value, and being sufficiently confident to ask questions about technical information and complicated summaries of test scores” (Leighton, et al., 2010, p. 9).

Although assessment literacy is important, it is a pity that many teachers do not seem to be equipped with a solid grounding in the basic knowledge of assessment principles or practices (Leighton, et al., 2010). Many educators have also noted that, for teachers, producing good tests is a demanding task (Davidson & Lynch, 2002). The general inadequacies of teachers’ knowledge of test-constructing skills can be shown in the

studies reviewed in the following.

Training in Teachers’ Test Construction

To begin with, Carter (1984), almost three decades ago, investigated teachers’

competence in test item development by asking them to identify and write specific items aimed to measure particular reading skills (main idea, detail, inference, and prediction), and by interviewing them about their test-constructing perceptions and processes. The results showed that teachers had more difficulty in identifying and developing items tapping higher-level reading skills (i.e., inference and prediction) than in identifying and writing items to test lower-level cognitive skills (i.e., main idea and detail). The interview data suggested that teachers felt insecure about their knowledge of basic principles for item writing and that they might possess a limited repertoire of test-constructing skills. Based on these results, Carter (1984) argued for an emphasis on the testing course in preservice and inservice teacher education.

To equip teachers with test-construction principles and to improve their test-writing skills, many teacher education programs began to include language testing courses. Kathleen M. Bailey and James D. Brown have reported the results of two questionnaire surveys (Bailey & Brown, 1996; Brown & Bailey, 2008) of instructors of language testing courses worldwide. They found that the contents of the language testing courses were quite diversified, covering topics such as hands-on experiences, general topics, item analysis, descriptive statistics, test consistency, and

To equip teachers with test-construction principles and to improve their test-writing skills, many teacher education programs began to include language testing courses. Kathleen M. Bailey and James D. Brown have reported the results of two questionnaire surveys (Bailey & Brown, 1996; Brown & Bailey, 2008) of instructors of language testing courses worldwide. They found that the contents of the language testing courses were quite diversified, covering topics such as hands-on experiences, general topics, item analysis, descriptive statistics, test consistency, and