• 沒有找到結果。

Note: Answers are bolded and appear as the first option

W: Yes, these shoes are too narrow

28

W: I'm going to buy some new shoes. These hurt my toes.

M:You should buy wider shoes then. They're much better for your feet.

W:Yes, these shoes are too narrow.

Q:Which statement is most likely true?

The woman does not like how her shoes fit.

The man thinks new shoes would look better.

The woman would rather wear narrow shoes.

The man wants buy wider shoes for the woman.

-.26 -

Note: Answers are bolded and appear as the first option.

The present researcher conjectures that the undergraduate subjects are still

relatively inexperienced in personal relations situated in the workplace or with the

opposite sex, and therefore tend to misunderstand the pragmatic or implied meanings

of the utterances in the stems and options. For example, stimulus-response items

five, six and 15 may present scenarios taken for granted as typical daily situations by

the panel of test reviewers, but which may be actually quite novel to the test-taking

population. Experienced adults may have no problem identifying perfume as an

appropriate gift option for “her”, an unspecified female third person, but some male

undergraduates in Taiwan are less precocious than international counterparts, the

responses of whom may have informed the creation of the item stem, and cannot see a

suitable response in the option array. Also, item 15 is certainly susceptible to gender

variance in that females are more concerned than males about strong sunlight on the

face and exposed skin. In this case, getting a hat before going out makes perfect

sense to one group of test-takers, while a small minority may be completely clueless

as to the implications of being “ready to go”.

The short conversation items provide more examples of typical standardized test

questions which present relatively unfamiliar situations to the testees. Items two,

nine and 19 are noteworthy because of the high gender DIF. Item two centers on the

dilemma of a gift for a third person, the same general scenario in stimulus-response

item five, except that it is extended into a conversation. This item has a much

sharper discrimination power for males than for females, indicating that it assesses

males’ experience with buying a gift for ‘someone who has it all’ more than listening

ability. Item nine, the conversation about bus fare, presents a scenario from another

culture (and arguably, different era of history) as nowadays commuters in Taiwan

typically pay with IC-embedded debit cards. Finally, item 19 centers on a discussion

about clothing import/export, a workplace situation which is only experienced

vicariously by testees through other commercially available English learning material,

if at all.

The object lesson learned during this test piloting is that even the best of expert

reviews still falls considerably short of the informative power of hard, empirical

results, a conclusion that echoes the advice of Ross and Okabe (2006) which also

witnessed a wide disparity between subjective “expert” reviews and objective

follow-up DIF analyses. In the present case, the content review panel consisted of

middle-aged Taiwanese and native-English speaking professors, all of whom were

married and have lived in Taiwan for many years, hence, the anomalous scenarios fell

within the bounds of the reviewers’ personal experience and were overlooked. The

source recordings came from practice material commercially available in Taiwan, so

there would be great likelihood of scenarios deemed common with respect to typical

test content. Therefore, as far as the expert reviewers were concerned, there was no

language obviously biased against any gender group, nor was there any scenario

completely alien to the prospective test-takers. Also, as every item had a single,

unambiguously correct option, the reviewers readily endorsed the item pools. The

relative lack of uniform DIF attests to the fact that, overall, the respective groups have

equivalent probabilities to answer correctly, so in this respect, the judgment of the

reviewers seems reasonable. Yet, the non-uniform DIF, the interaction of group

membership and items, paints a different picture, and this interaction remains opaque

to the experts.

For the minority male group, these items tend to sharply discriminate between

the men who have various social relations outside of academic settings and those

without. In contrast, the majority of females at the undergraduate age do not suffer

from such a dearth of social experience and, thus, there appears to be a gender-related

dimension confounding EFL listening ability as measured by several of these test

questions. Therefore, empirical gender DIF was manifest post-expert- review, as it

is very difficult for an experienced instructor to adequately review a test from the

collective perspective of all the prospective testees.

In conclusion, a number of piloted items exhibit non-uniform DIF effects which

indicate interaction between gender and the construct of EFL listening ability.

Furthermore, inclusion of the complete item pools would create a lengthy listening

exam of 63 items, which would be highly susceptible to fatigue and test effects, hence,

it was decided that removal of the suspect items is the most advantageous method for

maintaining integrity of the listening ability measurements.

Construct Validity Review

Although item analysis of the pilot test questions was informative regarding the

empirical characteristics such as difficulty, discrimination, item fits and DIF, the

precise construct of listening ability had yet to be defined, a requisite step before

analysis of the data collected during the main study.

To accomplish the construct validity of the testing instrument, the items were

compiled into a questionnaire and sent to seven (3 male, 4 female) experienced EFL

instructors in Taiwan, independent of the instructors who performed the initial review

of the item pools. Two panelists have taught EFL/ESL for more than 20 years, two

have 11-15 years of experience, two have six to 10 years of experience, and the

youngest panelist has one to five years of experience. Two panelists are native

English speakers and five panelists are non-native. All panelists are currently

teaching EFL in Taiwan.

The panelists were provided with a taxonomy of 10 listening skills based on

Brindley (1997), Eom (2008) and Rost (2005) and instructed to identify every

listening skill that was tested by each item (multiple answer). Next, the panelists

were asked to identify the first and second most essential listening skills for

responding correctly to the items (multiple choice), and finally the panelists were

asked to consider any content issues which could render the items problematic

(open-ended response).

The first multiple answer question was intended to spark earnest introspection in

the panelists such that they would consider how each test item reflects the repertoire

of listening skills entailed by the Rost (2005) listening process comprised of decoding,

comprehension, interpretation and response stages. The skills identified in the

panelists’ responses comprise the range of requisite listening skills tapped by each

item. The questions to identify the primary and secondary listening skills require

panelists to make judgments on the prime factor affecting the probability of item

correct responses. In other words, these questions require the panelists to define the

theta construct of listening ability for each item. Based on research described in

Brindley (1997) and Eom (2008), it is not anticipated that panelists can agree on

which skills are primary and secondary; more often than not, experts do not concur on

the underlying construct of test items. Although respective panelists may agree that

a common skill is an important factor, there is assumed to be substantial variability in

their judgments about which skill is ranked first and second resulting in frequently

tied rankings. To counteract this effect from the small sample size, the responses to

these two questions were aggregated and tallied akin to collapsing categories of Likert

responses. The skill endorsed most frequently across both categories was taken as

the primary factor reflected by the item and tagged as the single, “essential” listening

skill. The construct validity questionnaire displaying the taxonomy of 10 listening

skills provided for the panelists is attached as Appendix 4.

Results. Overall, it can be stated that the listening exam instrument tests the

higher processing “interpretation” skills based on unweighted frequencies of

endorsement (requisite skills) and weighted frequencies of endorsement (essential

skills). Table 22 summarizes the construct of listening skills at the test level.

Practically speaking, the item-level constructs are of principal importance, as the

items with clearly defined, uniform constructs will be compiled into a finalized

instrument for measurement of listening ability, the criterion variable in the final SEM

analysis. In this respect, the majority of the items (16) are presumed to test skill nine,

“interpreting key information” followed by eight items reflecting skill eight,

“interpreting the social/situational context” and four items reflecting skill seven,

“interpreting the illocutionary intent”.

Table 22.

Constructs of Listening Exam Items by Listening Skills Listening

Skill Skill Description Endorsement as:

Items reflecting skill Requisite Essential

1 Decoding verbs, including, but not limited to: tenses and prepositional verbs

63% 10% N/A

2 Decoding key vocabulary,

including, but not limited to: 67% 33% 9,11,34,37

discourse and cohesive markers

3 Decoding parts of speech/

morphosyntax 54% 4% N/A

4 Decoding auxiliary negatives 26% 3% N/A

5 Comprehending idiomatic

expressions 28% 8% 12,38

6 Comprehending details of

utterances 74% 30% 2,15,29,32,33,36

7 Interpreting the illocutionary

intent behind utterances 56% 24% 8,10,14,18 8 Interpreting the

social/situational context of utterances

73% 28% 1,16,17,21,29,30,34,35

9 Interpreting key information

in utterances 85% 44% 3,4,5,6,7,13,19,20,22,23,24,26,27,28,39,40 10 Interpreting the

attitude/emotional state of a speaker

50% 14% 25,31,36

Note. Percent rate of endorsement is proportion of total possible token endorsements, 280 per skill (7 panelists x 40 items).

Appendix 6 provides individual item histograms of requisite and essential skill

endorsement. Despite the aggregation of “first most essential” and “second most

essential” responses, a number of test items still exhibit tied rankings of the primary

underlying listening construct. When tied rankings of essential skills occurred, the

frequency count of requisite skills was used for arbitration under the assumption that a

listening skill could not be the primary underlying construct if it was not recognized

as extant by the majority of panelists. This method of arbitration resolved most tied

rankings, however, three test items: 29, 34 and 36, still reflected more than one

listening skill as the presumed underlying construct. Analysis of the test responses

in the main study necessarily excluded these items, as there was no consensus on the

construct being tested.

Implications for Main Study

This section discusses the necessity of modifications to the questionnaires and

listening test. A small number of minor modifications to the questionnaires are

proposed and descriptions of data analysis procedures to be adopted in the main study

are provided.

Questionnaire Instruments. The reader may recall that all of the self-assessed

latent trait measurement instruments potentially exhibit DIF effects. The DIF

analyses at this point are considered tentative, but point to non-uniform DIF, the

interaction of group membership on the metric of the item scales, as the primary risk

to measurement invariance, the evidence of which is the noticeably disparate

discrimination parameters between the groups. As females were the majority,

reference group, the DIF effects manifest themselves as negative ∆R2 values, i.e., the

majority of respondents yield relatively shallow sloped ICCs while the focal group

yields steeply sloped ICCs. Thus, assuming the presence of gender DIF, suboptimal

measurement is primarily manifest as a low contrast between high and low theta

individuals in the majority group. In practical terms, the items tend to measure the

traits better among the male minority, as the higher discriminations (loadings)

translate into greater variance explained among male respondents.

A common DIF remedy in testing research is revision of the affected items or

elimination from final versions of the instrumentation. Unfortunately, invocation of

these remedies is premature, as there are no indications within the wording of the

questionnaire items where gender influences endorsement, and loss of items would

lead to loss of information. Instead, it is proposed to retain the complete sets of

items to continue collecting observations in the main study sample, and check gender

invariance again.

A further consideration with respect to the latent trait questionnaire instruments

is the number of categories in the item response scales. It was earlier described how

the ISCEL pilot used a reduced, four-point scale to economize parameters in the final

SR model. Concurrent with the ISCEL pilot, four-point scales were trialed with the

BELLA and ELLSI items, as well. These results are detailed in Appendix 7 for

readers who maintain interest in this level of detail, as these findings are tangential to

the main research described herein. Suffice to say, the four –point scale is untenable,

as it reduces the resolution of trait-space considerably, leading to configural variance

in the BELLA scale and loss of dimensionality in the ELLSI scale. Thus, it was

decided that the final questionnaire instrument utilize five-point scales of frequency

and agreement for the three self-assessed latent trait inventories.

Listening Test. Piloting of the item pools for the stimulus-response and short

conversation items found that numerous items exhibited non-uniform DIF effects,

with minimal uniform DIF distortion. However, unlike the questionnaire

instruments, there are more than sufficient numbers of items remaining in the item

pools to construct a compact examination with reduced DIF effects. Moreover,

review of the suspect items can readily identify characteristics of poor test questions,

rendering item deletion as the optimal means to mitigate DIF effects. With these

considerations in mind, the listening exam was trimmed to 40 items based on the

presence of DIF effects and content review, and the revised exam underwent further

rounds of expert review to establish construct validity, i.e., specification and

identification of the listening skills measured by the exam, as previously described.

Another minor change to the construct of the test layout was made in response to

one construct reviewer’s perception of a dominant “travel/vehicle” theme to the test

questions. Several items pertaining to the context of travel, or vehicles in general,

were placed proximally, leading to a subjective impression of bias. The reader may

note short conversation items 23, 29, 34, 36, 37, 39 and 40 in Appendix 4. The

layout of the test during principal data collection repositioned these items to

intersperse them more evenly. Discussion of the listening test analysis in Chapters

Four and Five reflect this altered item order.

Additionally, a notable precaution resulting from the utilization of Mplus

modeling software to run 1PL item analysis is the dearth of item fit indices. The

itemχ2 tests of residual error for dichotomous items yielded unusually optimistic

results in the piloting analyses. These over-estimations of item fit are due to the lack

of trait-ability groupings in theχ2 contingency table. Univariate χ2 tests of item

residuals in Mplus merely classify respondents into a 2 x 2 table with 1 degree of

freedom, i.e., the overall expected proportion of success versus observed successes

per item. In contrast, the mean square residual outlier-sensitive fit (outfit: Wright &

Masters, 1990) derived from the Q1 χ2 statistic (Yen, 1981) utilize multiple

categories of ability groupings along the trait continuum, thus providing greater

precision in identification of un-modeled noise. When analyzing the test responses

in the main study, the researcher availed of the mean square residual outfit and

z-standardized outfit indices (aka. t-statistics) in lieu of univariate standardized

residuals and itemχ2 supplied by Mplus.

For the reader’s reference the Q1 index of item fit is the sum of the squared

standardized residuals expressed as:

Where m is the number of trait-ability groups, and z2ij is the squared standardized

residual for the item i at ability category j. The Q1 follows a χ2 distribution with

degrees of freedom equal to the number of ability groups in the analysis less the

parameterization of the model. In the case of the 1PL model, the degrees of freedom

are m-1. This Q1 item statistic divided by the number of observations, i.e., ability

groupings, derives the mean square residual outfit. Theχ2 distribution can then be

transformed by means of the Wilson-Hilferty transformation into a unit-normal

distribution with infinite degrees of freedom. This is commonly known as the

z-distribution with a critical value of ± 1.96; however, in IRT parlance this test of