Note: Answers are bolded and appear as the first option
W: Yes, these shoes are too narrow
28
W: I'm going to buy some new shoes. These hurt my toes.
M:You should buy wider shoes then. They're much better for your feet.
W:Yes, these shoes are too narrow.
Q:Which statement is most likely true?
The woman does not like how her shoes fit.
The man thinks new shoes would look better.
The woman would rather wear narrow shoes.
The man wants buy wider shoes for the woman.
-.26 -
Note: Answers are bolded and appear as the first option.
The present researcher conjectures that the undergraduate subjects are still
relatively inexperienced in personal relations situated in the workplace or with the
opposite sex, and therefore tend to misunderstand the pragmatic or implied meanings
of the utterances in the stems and options. For example, stimulus-response items
five, six and 15 may present scenarios taken for granted as typical daily situations by
the panel of test reviewers, but which may be actually quite novel to the test-taking
population. Experienced adults may have no problem identifying perfume as an
appropriate gift option for “her”, an unspecified female third person, but some male
undergraduates in Taiwan are less precocious than international counterparts, the
responses of whom may have informed the creation of the item stem, and cannot see a
suitable response in the option array. Also, item 15 is certainly susceptible to gender
variance in that females are more concerned than males about strong sunlight on the
face and exposed skin. In this case, getting a hat before going out makes perfect
sense to one group of test-takers, while a small minority may be completely clueless
as to the implications of being “ready to go”.
The short conversation items provide more examples of typical standardized test
questions which present relatively unfamiliar situations to the testees. Items two,
nine and 19 are noteworthy because of the high gender DIF. Item two centers on the
dilemma of a gift for a third person, the same general scenario in stimulus-response
item five, except that it is extended into a conversation. This item has a much
sharper discrimination power for males than for females, indicating that it assesses
males’ experience with buying a gift for ‘someone who has it all’ more than listening
ability. Item nine, the conversation about bus fare, presents a scenario from another
culture (and arguably, different era of history) as nowadays commuters in Taiwan
typically pay with IC-embedded debit cards. Finally, item 19 centers on a discussion
about clothing import/export, a workplace situation which is only experienced
vicariously by testees through other commercially available English learning material,
if at all.
The object lesson learned during this test piloting is that even the best of expert
reviews still falls considerably short of the informative power of hard, empirical
results, a conclusion that echoes the advice of Ross and Okabe (2006) which also
witnessed a wide disparity between subjective “expert” reviews and objective
follow-up DIF analyses. In the present case, the content review panel consisted of
middle-aged Taiwanese and native-English speaking professors, all of whom were
married and have lived in Taiwan for many years, hence, the anomalous scenarios fell
within the bounds of the reviewers’ personal experience and were overlooked. The
source recordings came from practice material commercially available in Taiwan, so
there would be great likelihood of scenarios deemed common with respect to typical
test content. Therefore, as far as the expert reviewers were concerned, there was no
language obviously biased against any gender group, nor was there any scenario
completely alien to the prospective test-takers. Also, as every item had a single,
unambiguously correct option, the reviewers readily endorsed the item pools. The
relative lack of uniform DIF attests to the fact that, overall, the respective groups have
equivalent probabilities to answer correctly, so in this respect, the judgment of the
reviewers seems reasonable. Yet, the non-uniform DIF, the interaction of group
membership and items, paints a different picture, and this interaction remains opaque
to the experts.
For the minority male group, these items tend to sharply discriminate between
the men who have various social relations outside of academic settings and those
without. In contrast, the majority of females at the undergraduate age do not suffer
from such a dearth of social experience and, thus, there appears to be a gender-related
dimension confounding EFL listening ability as measured by several of these test
questions. Therefore, empirical gender DIF was manifest post-expert- review, as it
is very difficult for an experienced instructor to adequately review a test from the
collective perspective of all the prospective testees.
In conclusion, a number of piloted items exhibit non-uniform DIF effects which
indicate interaction between gender and the construct of EFL listening ability.
Furthermore, inclusion of the complete item pools would create a lengthy listening
exam of 63 items, which would be highly susceptible to fatigue and test effects, hence,
it was decided that removal of the suspect items is the most advantageous method for
maintaining integrity of the listening ability measurements.
Construct Validity Review
Although item analysis of the pilot test questions was informative regarding the
empirical characteristics such as difficulty, discrimination, item fits and DIF, the
precise construct of listening ability had yet to be defined, a requisite step before
analysis of the data collected during the main study.
To accomplish the construct validity of the testing instrument, the items were
compiled into a questionnaire and sent to seven (3 male, 4 female) experienced EFL
instructors in Taiwan, independent of the instructors who performed the initial review
of the item pools. Two panelists have taught EFL/ESL for more than 20 years, two
have 11-15 years of experience, two have six to 10 years of experience, and the
youngest panelist has one to five years of experience. Two panelists are native
English speakers and five panelists are non-native. All panelists are currently
teaching EFL in Taiwan.
The panelists were provided with a taxonomy of 10 listening skills based on
Brindley (1997), Eom (2008) and Rost (2005) and instructed to identify every
listening skill that was tested by each item (multiple answer). Next, the panelists
were asked to identify the first and second most essential listening skills for
responding correctly to the items (multiple choice), and finally the panelists were
asked to consider any content issues which could render the items problematic
(open-ended response).
The first multiple answer question was intended to spark earnest introspection in
the panelists such that they would consider how each test item reflects the repertoire
of listening skills entailed by the Rost (2005) listening process comprised of decoding,
comprehension, interpretation and response stages. The skills identified in the
panelists’ responses comprise the range of requisite listening skills tapped by each
item. The questions to identify the primary and secondary listening skills require
panelists to make judgments on the prime factor affecting the probability of item
correct responses. In other words, these questions require the panelists to define the
theta construct of listening ability for each item. Based on research described in
Brindley (1997) and Eom (2008), it is not anticipated that panelists can agree on
which skills are primary and secondary; more often than not, experts do not concur on
the underlying construct of test items. Although respective panelists may agree that
a common skill is an important factor, there is assumed to be substantial variability in
their judgments about which skill is ranked first and second resulting in frequently
tied rankings. To counteract this effect from the small sample size, the responses to
these two questions were aggregated and tallied akin to collapsing categories of Likert
responses. The skill endorsed most frequently across both categories was taken as
the primary factor reflected by the item and tagged as the single, “essential” listening
skill. The construct validity questionnaire displaying the taxonomy of 10 listening
skills provided for the panelists is attached as Appendix 4.
Results. Overall, it can be stated that the listening exam instrument tests the
higher processing “interpretation” skills based on unweighted frequencies of
endorsement (requisite skills) and weighted frequencies of endorsement (essential
skills). Table 22 summarizes the construct of listening skills at the test level.
Practically speaking, the item-level constructs are of principal importance, as the
items with clearly defined, uniform constructs will be compiled into a finalized
instrument for measurement of listening ability, the criterion variable in the final SEM
analysis. In this respect, the majority of the items (16) are presumed to test skill nine,
“interpreting key information” followed by eight items reflecting skill eight,
“interpreting the social/situational context” and four items reflecting skill seven,
“interpreting the illocutionary intent”.
Table 22.
Constructs of Listening Exam Items by Listening Skills Listening
Skill Skill Description Endorsement as:
Items reflecting skill Requisite Essential
1 Decoding verbs, including, but not limited to: tenses and prepositional verbs
63% 10% N/A
2 Decoding key vocabulary,
including, but not limited to: 67% 33% 9,11,34,37
discourse and cohesive markers
3 Decoding parts of speech/
morphosyntax 54% 4% N/A
4 Decoding auxiliary negatives 26% 3% N/A
5 Comprehending idiomatic
expressions 28% 8% 12,38
6 Comprehending details of
utterances 74% 30% 2,15,29,32,33,36
7 Interpreting the illocutionary
intent behind utterances 56% 24% 8,10,14,18 8 Interpreting the
social/situational context of utterances
73% 28% 1,16,17,21,29,30,34,35
9 Interpreting key information
in utterances 85% 44% 3,4,5,6,7,13,19,20,22,23,24,26,27,28,39,40 10 Interpreting the
attitude/emotional state of a speaker
50% 14% 25,31,36
Note. Percent rate of endorsement is proportion of total possible token endorsements, 280 per skill (7 panelists x 40 items).
Appendix 6 provides individual item histograms of requisite and essential skill
endorsement. Despite the aggregation of “first most essential” and “second most
essential” responses, a number of test items still exhibit tied rankings of the primary
underlying listening construct. When tied rankings of essential skills occurred, the
frequency count of requisite skills was used for arbitration under the assumption that a
listening skill could not be the primary underlying construct if it was not recognized
as extant by the majority of panelists. This method of arbitration resolved most tied
rankings, however, three test items: 29, 34 and 36, still reflected more than one
listening skill as the presumed underlying construct. Analysis of the test responses
in the main study necessarily excluded these items, as there was no consensus on the
construct being tested.
Implications for Main Study
This section discusses the necessity of modifications to the questionnaires and
listening test. A small number of minor modifications to the questionnaires are
proposed and descriptions of data analysis procedures to be adopted in the main study
are provided.
Questionnaire Instruments. The reader may recall that all of the self-assessed
latent trait measurement instruments potentially exhibit DIF effects. The DIF
analyses at this point are considered tentative, but point to non-uniform DIF, the
interaction of group membership on the metric of the item scales, as the primary risk
to measurement invariance, the evidence of which is the noticeably disparate
discrimination parameters between the groups. As females were the majority,
reference group, the DIF effects manifest themselves as negative ∆R2 values, i.e., the
majority of respondents yield relatively shallow sloped ICCs while the focal group
yields steeply sloped ICCs. Thus, assuming the presence of gender DIF, suboptimal
measurement is primarily manifest as a low contrast between high and low theta
individuals in the majority group. In practical terms, the items tend to measure the
traits better among the male minority, as the higher discriminations (loadings)
translate into greater variance explained among male respondents.
A common DIF remedy in testing research is revision of the affected items or
elimination from final versions of the instrumentation. Unfortunately, invocation of
these remedies is premature, as there are no indications within the wording of the
questionnaire items where gender influences endorsement, and loss of items would
lead to loss of information. Instead, it is proposed to retain the complete sets of
items to continue collecting observations in the main study sample, and check gender
invariance again.
A further consideration with respect to the latent trait questionnaire instruments
is the number of categories in the item response scales. It was earlier described how
the ISCEL pilot used a reduced, four-point scale to economize parameters in the final
SR model. Concurrent with the ISCEL pilot, four-point scales were trialed with the
BELLA and ELLSI items, as well. These results are detailed in Appendix 7 for
readers who maintain interest in this level of detail, as these findings are tangential to
the main research described herein. Suffice to say, the four –point scale is untenable,
as it reduces the resolution of trait-space considerably, leading to configural variance
in the BELLA scale and loss of dimensionality in the ELLSI scale. Thus, it was
decided that the final questionnaire instrument utilize five-point scales of frequency
and agreement for the three self-assessed latent trait inventories.
Listening Test. Piloting of the item pools for the stimulus-response and short
conversation items found that numerous items exhibited non-uniform DIF effects,
with minimal uniform DIF distortion. However, unlike the questionnaire
instruments, there are more than sufficient numbers of items remaining in the item
pools to construct a compact examination with reduced DIF effects. Moreover,
review of the suspect items can readily identify characteristics of poor test questions,
rendering item deletion as the optimal means to mitigate DIF effects. With these
considerations in mind, the listening exam was trimmed to 40 items based on the
presence of DIF effects and content review, and the revised exam underwent further
rounds of expert review to establish construct validity, i.e., specification and
identification of the listening skills measured by the exam, as previously described.
Another minor change to the construct of the test layout was made in response to
one construct reviewer’s perception of a dominant “travel/vehicle” theme to the test
questions. Several items pertaining to the context of travel, or vehicles in general,
were placed proximally, leading to a subjective impression of bias. The reader may
note short conversation items 23, 29, 34, 36, 37, 39 and 40 in Appendix 4. The
layout of the test during principal data collection repositioned these items to
intersperse them more evenly. Discussion of the listening test analysis in Chapters
Four and Five reflect this altered item order.
Additionally, a notable precaution resulting from the utilization of Mplus
modeling software to run 1PL item analysis is the dearth of item fit indices. The
itemχ2 tests of residual error for dichotomous items yielded unusually optimistic
results in the piloting analyses. These over-estimations of item fit are due to the lack
of trait-ability groupings in theχ2 contingency table. Univariate χ2 tests of item
residuals in Mplus merely classify respondents into a 2 x 2 table with 1 degree of
freedom, i.e., the overall expected proportion of success versus observed successes
per item. In contrast, the mean square residual outlier-sensitive fit (outfit: Wright &
Masters, 1990) derived from the Q1 χ2 statistic (Yen, 1981) utilize multiple
categories of ability groupings along the trait continuum, thus providing greater
precision in identification of un-modeled noise. When analyzing the test responses
in the main study, the researcher availed of the mean square residual outfit and
z-standardized outfit indices (aka. t-statistics) in lieu of univariate standardized
residuals and itemχ2 supplied by Mplus.For the reader’s reference the Q1 index of item fit is the sum of the squared
standardized residuals expressed as:
Where m is the number of trait-ability groups, and z2ij is the squared standardized
residual for the item i at ability category j. The Q1 follows a χ2 distribution with
degrees of freedom equal to the number of ability groups in the analysis less the
parameterization of the model. In the case of the 1PL model, the degrees of freedom
are m-1. This Q1 item statistic divided by the number of observations, i.e., ability
groupings, derives the mean square residual outfit. Theχ2 distribution can then be
transformed by means of the Wilson-Hilferty transformation into a unit-normal
distribution with infinite degrees of freedom. This is commonly known as the