• 沒有找到結果。

In this section, the relationship between teaching and testing is discussed. Also, literature related to using multiple choice questions to measure reading skills is presented.

Why Testing (The Relationship between Teaching and Testing)

Testing is an important part of every teaching and learning experience (Madsen, 1983).The proper relationship between teaching and testing is surely that of partnership (Hughes, 2003). Testing should not only follow teaching but should support good teaching and exerts a corrective influence on bad teaching. Therefore, understanding of tests could provide information about whether our teaching needs improvement or certain part of our teaching needs more attention. The importance of testing has been emphasized by a number of researchers. As McNamara (2000) pointed out, language tests can serve as a valuable tool for teachers by providing information that are relevant to language learning: such as (1) evidence of the results of learning and instruction as well as feedback on the effectiveness of the teaching program, (2) information relevant to decision-making about individuals, and (3) information for teachers to clarify instructional objectives. Tests can provide information about teaching and learning and thus tests can be helpful to both teachers and students.

According to Madsen (1983), well-made tests can help students develop positive attitudes toward instruction by giving them a sense of accomplishment and a feeling that the teacher’s evaluation of them corresponds to what has been taught. In addition, tests also help foster learning since they help the teachers confirm which part of the learning each student has mastered and which part needs further attention and improvement. What’s more, good English tests can assist the students in learning the language by requiring them to study hard, emphasizing course objectives, and showing them where they need to improve.

The qualities desirable for a good test are validity, reliability, and practicality (Bachman, 1990; Brown, 2001; Harries, 1969; Hughes, 2003). These qualities are generally regarded as the basic requirements of a good test. Validity is considered as

the most complex criterion of a good test (Brown, 2003). A test is said to have validity when it actually measures what it intends to measure. For example, if the purpose of a reading test is to examine the reading ability of a particular group of students, the test results should reflect their true reading ability. Validity is an important factor in designing good reading comprehension tests (Sequera, 1995).

Three types of important validation of a test are content validity, face validity, and construct validity. The core element of test validity is the construct—the theoretical representation of the skill or knowledge that the test purports to measure (Slomp, 2005, p. 149). As such, the validity of a test rises or falls in accordance with the degree to which its scores are a reflection of students’ ability in relation to the construct. Hughes (2003) defined construct in language testing as “any underlying ability (or trait) that is hypothesized in a theory of language ability.” (p. 31). Hence, in the case of reading comprehension, tests should reflect the theoretical assumptions under which reading teachers operate (Sequera, 1995). Therefore, it is assumed that the tests reflect the objectives of reading abilities stipulated in the curriculum guidelines and it is one of the issues that the present study aims to address.

Testing Comprehension with Multiple-choice Questions

Many textbooks on language testing (e.g., Heaton, 1988; Hughes, 1989; Weir, 1990, 1993) have given examples of testing techniques that might be used to assess language abilities. Among various test techniques, multiple-choice (hence MC) format is a common device for testing students’ reading comprehension.

Multiple-choice (MC) is considered to be an objective technique which “requires intellectual discrimination skills, a versatile test capable of probing a variety of areas and different types of cognitive activities such as acquisition of knowledge, understanding, application, analysis and evaluation” (Green, 1975; Marshall, 1971, cited in Nevo, 1989). The most obvious advantage of MC questions is probably in that

the scoring can be reliable, rapid, and economical. In addition, MC format allows for more items to be included in a given period of time. A further advantage is that MC format allows the testing of receptive skills without requiring the test-takers to produce output.

Nevertheless, given the virtues of multiple-choice questions, this format has been criticized by researchers on a number of grounds. A serious disadvantage with MC questions is that guessing may have a considerable but unknowable effect on test scores (Hughes, 2003). Test-takers may get an item correct for the wrong reason without actually understanding the text. Another objection to the use of multiple-choice questions is that they are often passage-independnet which means that the items could be answered without reference to the reading passages accompanying them (Bernhart, 1991; Teale & Rowley, 1984; Weir, 1997). Evidence has been found that test-takers do not always recourse to the reading passages in response to MC questions (Cohen, 1998; Nevo, 1989); it is suspected that explicit teaching of certain techniques will help students become test-wise and thus improve their scores (Richards, 1997). Nevo (1989, p. 212) argued that “it would appear useful to devote attention, time, and effort to guiding and training students in coping effectively with a test format like [the multiple-choice test]”.

A further concern with the use of multiple-choice (MC) format is that there has been much doubt with respect to the validity of multiple-choice (MC) questions. Weir (1990) and Urquhart & Weir (1998) argued that answering multiple-choice questions is an unreal task since in real life communications one is rarely required to choose one answer from four choices to show understanding. In a multiple-choice test, distractors can be used to trick the test-takers by presenting choices that the test-takers may not have otherwise have thought of (Alderson, 2000; Richards, 2000; Weir, 1990).

Test-takers could be deliberately tricked into confusing dilemmas. Therefore, it is

difficult to know whether failure of questions is due to lack of comprehension of the text or lack of comprehension of the question.

Despite of the criticisms against the use of the MC technique, multiple-choice questions has been widely used to assess reading ability and therefore issues have been raised about what MC questions actually measure in reading tests and whether they are valid measurements of reading ability (Cummings, 1982; Farr, Prichard &

Smitten, 1990; Urquhart & Weir, 1998; Weir, 1997). Weir (1997) proposed a four-level version of reading comprehension for testing purposes: reading expeditiously for global comprehension, reading expeditiously for local comprehension, reading carefully for global comprehension, and reading carefully for local comprehension. In Weir’s version of operations in reading, skills such as understanding the syntactic structure of a sentence and clause, understanding lexical and/or grammatical cohesion, understanding word meaning and locating specific details are bottom-up skills operated when reading at local levels, while identifying the main idea and making inferences are top-down skills operated at global levels.

Similarly, Harrison (1983, in Navo, 1989) argued that through the MC format, it is considered possible to check all reading levels (the semantic and syntactic aspects of the test), the discourse level (cohesion and coherence connections among various parts of the text), and the pragmatic level (the writer’s point of view). In other words, bottom-up reading skills and top-down reading skills could be tested via MC questions.

Studies of Reading Tests Analysis (Item Analysis)

As noted previously, researchers agree that tests can provide information about teaching and learning and thus tests can be helpful to both teachers and students. An analysis of the reading comprehension test items also reveals the knowledge and skills valued by the reading comprehension tests. Research into reading tests has received

much interest in the field of reading research in Taiwan. Both SAET and DRET exert considerable influence on English teaching and learning in Taiwan because the scores a student receives on each exam is a determining criterion for admission to desired universities. Thus, with the importance of both tests, they have been topics for discussion and research reports on the overall test content analysis and statistical analysis of both tests have been conducted annually by the College Entrance Examination Center (CEEC Web site; Huang, 1994; Jeng 1992; Jiang & Lin, 1999;

Xu & Lu, 1998). In particular, most of the studies on the SAET and DRET aimed at providing an overview of the test construction or statistical results of difficulties of items, distractors, and passing rates, etc. rather than a thorough qualitative analysis on the reading skills tested on each comprehension item in particular (e.g., Huang, 1994;

Jeng, 1992; Jiang & Lin, 1999; Xu & Lu, 1998). Most statistical analyses have often focused on the numbers of items and distribution, length of text, vocabulary, topics, discriminatory power, and examinees’ test performances (e.g., passing rates). Huang (1994) conducted a qualitative analysis of the Joint College Entrance Examination (hence JCEE, renamed as DRT in 2002) English test items from 1985 to 1994. The results of the reading comprehension item analysis showed that roughly over 90% of the items were well-designed, yet a few were not well-constructed or not even reading comprehension items at all. Items designed to test examinees’ vocabulary in context are reading comprehension questions on account that the readers have to look for contextual clues in the texts. However, Huang found that some items designed to test examinees’ vocabulary were not well-written because they could be answered without referring back to the text. What’s more, he found that some items were designed to test examinee’s grammar knowledge instead of their reading abilities. He indicated items designed to test examinees’ knowledge of vocabulary and grammar without referring back to the text are ill-written and thus should be excluded. He also said that

cautions should be taken when constructing items that involve arithmetic.

Jeng (1992) conducted a statistical analysis of the English test of JCEE in 1991.

This analysis focused on the difficulty of items, discriminatory power, and distractors in the test. There were three formats of questions, including sections on conversations, cloze, and reading comprehension. It was found that 76% of all of the items in the test, a total of 34 items, were well-written. Among the well-written items, 13 items were reading comprehension items, which amounted to 86% of the well-written items in total. The results also showed that the overall reliability of the test items reached .90.

Xu and Lu (1998) studied various elements of the JCEE English test content in 1998, including the topics, text length, syntactic complexity, vocabulary, distractors, and question types. The researchers found that these reading comprehension test items could usually be classified into four types: vocabulary, main idea, detail, and inference. However, they did not further identify those elements item by item or examine the frequency and distribution of different items. The aforementioned studies aimed at exploring the overall test construction or statistical results rather than a thorough qualitative analysis on the reading skills tested on each comprehension item in particular.

Recently, two studies (Hsu, 2005 ; Lu, 2002) used Mo’s taxonomy as the coding scheme to analyze the reading comprehension test items on SAET and DRET. Lu (2002) conducted both qualitative and quantitative analyses of the reading skills measured in reading comprehension section on the SAET from 1995 to 2002.

Qualitatively, Lu categorized the reading comprehension test items by using Mo’s classification as the coding scheme. Lu also examined the texts and the variables that affected the passing rates of high achievers and low achievers. With respect to quantitative analysis, she computed the frequency distribution of question types and the correlation between question types and passing rates. Results showed that the

most frequent question types were the items on details, followed by items on inference, main idea, writer’s style/tone, organization, and word meaning. Lu further categorized the items on detail into seven types: (1) specific-answer questions, (2) identifying true/false statements, (3) cause-effect, (4) number/date, (5) contrast, (6) sequence-of-events, and (7) following-directions. Lu’s findings also revealed that in general, all of the examinees performed best on items that test word meaning, followed by items on main idea, details, and inference, while they performed worst on items that test organization and style/tone. With respect to the performances by high and low achievers, it was found that high achievers performed best on items that tested details, followed by items on inference, word meaning, and main idea whereas low achievers on average performed fairly on items that tested word meaning and main idea but badly on items that tested details and inference. Neither group of achievers performed well on items that test the writer’s style/tone. Both high and low achievers have difficulties in several areas such as answering items that demand for higher-order reading processes, answering items that require interpretation and inference, synthesizing numerous details to get the item right, recognizing textual features, understanding lengthy articles with unfamiliar topics, and so on. However, I found that that Lu’s study exhibited several weaknesses. First, the analysis in Lu’s study is subject to the researcher’s own interpretations. It would have been better if another rater is recruited to analyze the test items to see if there is consistency between raters. Secondly, it was found that the analysis of items on details and on inference is confusing because Lu categorized some items which tested inference skills into different categories, some as items on details and others as items on inference of details. Thus, it is possible that Lu’s method of categorization is flawed.

Hsu (2005), applying the same coding scheme, attempted to analyze the reading comprehension test items on the 2000 & 2001 JCEE English test and DRET from

2002 to 2004. The themes of the texts, text variables that accounted for item difficulty, the examinees’ passing rates on each question type and the discrimination index of the test items were also investigated. Moreover, the Word List published in 1996 by the CEEC was used to analyze the use of words in the chosen texts to explore which words were beyond the scope of the Word List. To examine the passing rates on each question type, instead of computing the statistical data of passing rates provided by the CEEC, Hsu recruited 76 year-two students (divided into the high-proficiency group of 20 students, the middle-proficiency group of 36 students, and the low-proficiency group of 20 students) from two classes in a high school in Kaohsiung city as participants to answer questions based on eighteen passages on the 2001 JCEE and 2002-2004 DRET and used their performance for data analysis. Similar to Lu’s (2002) study, it was found that the most frequently used type of question is the detail item. Likewise, examinees performed well on items that tested lower-level skills such as determining the meaning of words and finding specific details, whereas they performed worst on questions of drawing inferences, which required higher level processing. As for the passing rates of high and low achievers, it was found that among the four most frequent types of questions (items on word meaning, detail, main idea, and inference), the high-proficiency group performed best on items that tested details and word meaning yet worst on items that tested main idea. The low proficiency group performed best on the word meaning item yet they performed worst when requested to draw inferences and identify the main idea of a text. Both high and low group performed best on the detail item. Similar to Lu’s (2002) study, the findings revealed that items on details were the most frequently tested. Likewise, examinees performed well on items that tested lower-order skills such as determining the meaning of words and finding specific details, whereas they performed worst on items that tested higher-order skill such as drawing inferences. However, several

drawbacks can be found in Hsu’s study. Firstly, Hsu’s study would have been more convincing if the author had directly computed the passing rates of examinees actually taking those tests rather than recruiting students who did not take the test as participants. If Hsu had computed the passing rates of the examinees who actually took the test, the results may turn out differently. Secondly, it was found that the categorization of some test items by Hsu is problematic. It has been found that items that measure test takers’ knowledge of discourse were categorized as items testing word meaning and items testing details in Hsu’s study. Take the item analysis of DRET 2002 for example, Q 56 was categorized as a word meaning item by Hsu while it in fact tested discourse knowledge of referent “he.” An examinee is required to look for contextual clues in order to assign meaning to the referent he. Q 51 on the 2004 DRET, which tested discourse knowledge of referents as well, was classified as a detail item in Hsu’s study. Thus, it is assumed that all items that tested discourse knowledge were completely categorized incorrectly as word meaning items and detail items by Hsu. Therefore, the results of Hsu’s study may be problematic.

A more recent study by Lan (2007) aimed to analyze the reading comprehension question types on the SAET and DRET both qualitatively and quantitatively by adopting a revised Bloom’s taxonomy. Lan’s study aimed to investigate what cognitive process levels and knowledge types were tested in both SAET and DRET from 2002 to 2006. The revised Bloom’s taxonomy identified by Anderson and Krathwohl (2001) was divided into two dimensions—the knowledge dimension and the cognitive process dimension. The knowledge categories of the revised Bloom’s taxonomy contains four major types of knowledge with subtypes under each category—factual knowledge, conceptual knowledge, procedural knowledge, and metacognitive knowledge. The cognitive process dimension includes six cognitive process categories and each includes subcategories. The six major types of cognitive

process dimensions are Remember, Understand, Apply, Analyze, Evaluate and Create.

The results of Lan’s study showed that from the item classification of the 2002 to 2006 SAET and DRET, four major cognitive process levels along with eight sub-levels and three types of knowledge along with four subtypes were found. The four main cognitive processes identified were: Remember, Understand, Apply, and Analyze. Five major combinations of cognitive levels and types of knowledge found in the study were: (1) Remember Factual Knowledge, (2) Understand Factual Knowledge, (3) Understand Conceptual Knowledge, (4) Apply Procedural Knowledge, and (5) Analyze Conceptual Knowledge. Nine sub combinations were identified as well: (1) Recognizing specific details and elements, (2) Interpreting specific details and elements, (3) Inferring specific details and elements, (4) Classifying into classifications and categories, (5) Summarizing principles and generalizations, (6) Inferring classifications and categories, (7) Explaining principles and generalizations, (8) Executing subject specific skills and algorithms, and (9) Attributing principles and generalizations.

The results of Lan’s study revealed that in 2002 to 2006 SAET and DRET reading comprehension test item analysis, 5 major question types were identified: (1) Remember Factual Knowledge, (2) Understand Factual Knowledge, (3) Understand Conceptual Knowledge, (4) Apply Procedural Knowledge, and (5) Analyze Conceptual Knowledge. Among the question types emerged, items on Remember and Understand Factual Knowledge were the most frequently tested throughout the five years, accounting for 74.1% in the SAET and 73% in the DRET. Items at the Evaluate and Create levels were not found.

The result of item frequency in the SAET showed that around half of the items aimed to test students’ ability to recognize facts in the passages and almost one third of the items measured the ability to understand specific details. Similar to the SAET,

the DRET focused on testing students’ abilities to identify and to understand facts as well. In terms of frequency, both tests showed a similar pattern since most of the

the DRET focused on testing students’ abilities to identify and to understand facts as well. In terms of frequency, both tests showed a similar pattern since most of the