The results of parallel analysis are summarized with a modified scree plot
overlaying the EFA eigenvalues of real observations against the average and 95th
percentile of 50 parallel simulations. As shown in Figure 11, the first two factors of
the proposed ELLSI instrument yield significant eigenvalues, corroborating the
researcher’s interpretation of the factor loadings in Table 7.
Confirmatory MIRT compared the unidimensional model and the
two-dimensional measurement model stipulated by parallel analysis and identified the
latter model as the more accurate reflection of the observed pattern of responses.
The fit indices from this MIRT analysis are shown in Table 8. Table 9 displays the
ELLSI indicators and standardized loadings in the currently identified structure,
which exhibit a high intercorrelation at 0.78 (S.E. = 0.04, p = 0.00).
Figure 11. ELLSI Parallel analysis scree plot with university subjects (n =315) as
observed sample.Table 8.
Fit Comparisons of Hypothesized ELLSI Models
Model
df (Parameter) -2LL AIC BIC Adj. BIC
1D 161 (115) 16880.16 17110.15 17541.7 17176.95 2D 160 (116) 16739.94 16971.95 17407.25 17039.33
Table 9.
ELLSI MIRT Two-dimensional Structure for University Subjects n = 315.
Item Descriptor Std a
R
21 I pay attention to the main points of the conversation in
English to get a general understanding of what is said. 0.67 0.45 2 I listen to the other person’s speech to determine if he/she
has understood me correctly. 0.64 0.41
3 I guess the meaning of unknown words or expressions by
noticing redundant words or phrases with similar meaning. 0.58 0.34 Dimension
1
4 I guess the meaning of unknown words by noticing the
speaker’s tone of voice. 0.61 0.37
5 I guess the meaning of unknown words by noticing the
gestures, actions, or facial expressions of the speaker. 0.60 0.36 6 I guess the speaker’s attitude toward the topic of discussion
by noticing redundant words or phrases with similar meaning.
0.75 0.56
7 I guess the speaker’s intentions by noticing the gestures,
actions, or facial expressions of the speaker. 0.66 0.44 9 I pay attention when the speaker communicates new or
important information by noticing the intonation or stress on words.
0.71 0.51
10 I use personal experience to understand the speaker’s
meaning and intentions. 0.74 0.55
11 I use my knowledge of the world to understand the
speaker’s meaning and intentions. 0.76 0.57
12 I use my knowledge learned from school to understand the
speaker’s meaning and intentions. 0.62 0.41
13 I pay attention to English words or expressions that are
similar to Chinese words or expressions. 0.56 0.31 15 When listening to CD texts, I use the title to guess the 0.52 0.27
content or main idea of what I will hear.
22 I mentally prepare to listen by reviewing what I know and
don’t know about the topic. 0.49 0.24
23 I use knowledge of English stress and intonation to help me
figure out words spoken unclearly. 0.55 0.30
8 I judge how well I was able to understand the other person’s
speech. 0.51 0.26
14 I figure out the relationship between events when listening
to a passage. 0.69 0.48
16 When studying outside of class, I pay attention to my
feelings about the listening passage. 0.69 0.47
17 When studying outside of class, I make sure to choose
listening passages/materials that I like. 0.60 0.36 Dimension
2
18 When listening to a difficult passage outside of class, I group words and expressions together based on common features.
0.58 0.34
19 When listening to a passage in class, I pay attention to my
feelings about the passage. 0.71 0.51
20 When conversing in English, I identify chunks of words, or
phrases, rather than single words that the other person says. 0.73 0.54 21 When listening to a difficult passage, I identify chunks of
words, or phrases, rather than single words. 0.67 0.45
The high intercorrelation of the strategic dimensions requires further analysis.
The intercorrelation is higher than the AVE, suggesting that the two dimensions lack
discriminant validity (Blunch, 2008; Kline, 2005). However, nested model
comparisons shown in Table 8 effectively rule out the parsimonious hypothesis of a
single continuum of trait listening strategy use. Nevertheless, an alternative,
second-order factor structure must also be tested to ascertain the presence/absence of
a higher order general strategy use dimension, g. The presence of g could account
for the oblique, between-factor correlation, whereas the absence of g would suggest
that obliqueness is a reflection of the contingency relation between the two constructs,
and should be construed as a characteristic of these latent variables. The comparison
of hypothetical models is summarized in Figures 12 and 13, meanwhile, the results of
nested model comparisons are shown in Table 10.
Figure 12. ELLSI Hypothetical second-order factor structure.
Figure 13. ELLSI Baseline oblique, first-order factor structure.
Table 10.
Fit Comparisons of Second-order and First-order Factor Structures.
Model
df (Parameter) -2LL AIC BIC Adj. BIC
2nd order 159 (117) 16740.74 16974.74 17413.79 17042.70 1st order 160 (116) 16739.94 16971.95 17407.25 17039.33The first-order baseline structure yields the best fit across all the indices,
although the difference is marginal, and the -2LL test indicates non-significance with
1df. However, every fit index shows concurrence that the parsimonious, first-order
model without an underlying general strategy factor, g, is the one with less residual
discrepancy.
Accordingly, review of the ELLSI indicators in the first-order structure as
currently identified suggest two extant dimensions to the variable of listening strategic
knowledge: Offline and Online strategies. The Offline strategic dimension is
indicated by items which fall exclusively under the canonical categories of cognitive
and metacognitive strategies across one-way and two-way, interactive listening.
This type of listening strategy is deliberative and likely occurs asynchronous to the
stream of aural input, hence the label “offline” in reference to cognitive processing
theories. The Online strategies also span one-way and two -way listening, but
include strategies normally identified as affective in addition to others which are more
readily recognized as cognitive or metacognitive. This type of strategy is availed
during processing synchronous to the stream of aural input, and must be utilized under
temporal constraints. Hence, under the two dimensional model, they are measured
along a continuum distinct from the Offline strategies.
The salient theme within the Online strategy indicators is the inclusion of
affect-valenced indicators describing feelings, emotions, or “gut-feelings”. The
reader should note item eight, describing a metacognitive strategy of monitoring in
which the item is worded generally by querying whether respondents self-assess their
listening skills. Because this item is situated in live interaction with strict temporal
constraints on processing, it may be conjectured that L2 listeners require
affect-valenced intuition to render synchronous judgment. Also, chunking strategies
in items 20 and 21 are generally considered cognitive strategies. Here again, the
items may serve as locus for the listener’s affective responses to influence the
implementation of the strategy. Since the strategy is in the setting of live interaction
(item 20) or listening to a difficult passage that may be other-controlled in a
classroom setting (item 21), it is again possible to conjecture that thought units are
being constructed out of the aural input based on the listeners’ intuitions due to strict
temporal constraints on aural processing. In this fashion, the two dimensions appear to
represent different types of thinking- slow, deliberative “offline” thinking versus
instantaneous, real-time, “online” thinking. Further analysis in the main study was
required to confirm or disconfirm this hypothetical dichotomy in the listening
strategic knowledge dimensions.
Akin to the validation of the BELLA scale, the ELLSI items next underwent DIF
testing for uniform and non-uniform variance across gender groupings. Mplus 5.1
was used to perform DIF analysis on the ELLSI items using syntax for 1PL mixture
modeling with known classes (1 = male, 0 = female). It was found that eight items
exhibited non-uniform DIF effects and only one item showed uniform DIF, albeit with
negligible effect size. Results of non-uniform and uniform DIF are provided in
Tables 11 and 12, respectively.
Table 11.
ELLSI Non-uniform DIF Test Results (females are reference group)
Baseline Models
23 0.470 18142.798 0.050
Note. Effect sizes are reported only for items with significant DIF. Critical value for χ
2 (df = 1, p = 0.05) is 3.841.Table 12.
ELLSI Uniform DIF Test Results (females are reference group) Baseline Models
14 18779.964 4.606
Online 16 18780.236 4.334 -
17 18768.364 16.206 -0.063
18 18781.430 3.14 -
19 18783.472 1.098 -
21 18777.868 6.702 -
22 18781.178 3.392 -
Note. Effect sizes are reported only for items with significant DIF. Critical value for χ
2 (df = 4, p = 0.05) is 9.488.Similar to the BELLA DIF analyses, the estimated dimensional loadings of
ELLSI items tend to be higher for males, the minority group, meaning that numerous
items have reduced power to discriminate females with high and low propensities of
theta witnessed by the lower slope gradient. Likewise, the group-item interactions in
the ELLSI scale have negligible impact on respondents’ tendencies to endorse items,
the uniform DIF, which is evident to a small degree (-.063) in item 17 only. These
results should be taken as tentative, considering the wide sampling disparity between
the gender groupings and the sub-optimal ratio of cases to estimated parameters in the
DIF models.
Validating Measurement Model of Strategic Competence
Descriptions of piloting procedures and data analysis specific to the application
of the ISCEL questionnaire on university undergraduates are provided. The
procedure for validating ISCEL was similar to the validation of the ELLSI and
BELLA scales wherein EMIRT, confirmatory MIRT, and logistic regression DIF
testing was utilized to identify and confirm invariance of the ISCEL structure. To
accomplish piloting of the ISCEL inventory, responses from a sample of 475 (222
male, 246 female, 7 unspecified) university undergraduates of mixed major of study
were collected from public and private universities in northern and eastern Taiwan at
the start of the fall semester in 2014. The entire sample of 475 participants in the
ISCEL pilot was independent of previous samples for cross-validation and piloting of
the BELLA and ELLSI scales. An additional notable feature of this round of
piloting is that the ISCEL instrument was configured with a four-point response scale
(0-3). The scale was reduced in an attempt to economize parameterization in the
anticipated SR model, which, if the 40 listening exam items were retained as
individual observable variables, would yield 104 indicators across six latent
dimensions. The sum of thresholds and loadings for antecedent and criterion
variables in the final CML could render MLR estimation untenable for standard
notebook and desktop PCs.
Results of ISCEL Pilot
Investigation of the dimensional structure of the ISCEL observed responses
began with parallel analysis to expediently narrow the focus of further analyses.
Parallel EFA analysis, wherein eigenvalues derived from the observed data are
compared against those derived from average and 95th percentile of 50 normally
distributed simulated samples, indicated a unidimensional structure to ISCEL
responses. These results are summarized in Figure 14 showing the overlay of scree
plots. Subsequent rounds of EMIRT analyses compared unidimensional and two
dimensional variants to confirm or disconfirm the parallel EFA results. The results
of these analyses confirm the unidimensionality of the ISCEL scale, as the
unidimensional model converged smoothly, yielding a complete set of significant
loadings and threshold estimates, whereas the two-dimensional model failed to
converge without numerous parameter constraints, which resulted in a complete set of
non-significant loadings and threshold estimates. Tables 13 and 14 summarize the
results of comparative EMIRT analyses of the ISCEL items.
As shown in Figure 14, the eigenvalue for the second factor in the observed data
correlation matrix falls below both the average and 95th percentile of the 50 simulated
samples. The observed data yielded a second eigenvalue of 1.260, whereas the
average and 95th percentile eigenvalues for the second factor were 1.356 and 1.332,
respectively. This means that the eigenvalues for the second and subsequent vectors
in the observed data matrix, although greater than one, are still neither significantly
greater nor substantively more informative than eigenvalues derived from random
observations, thus indicative of a simple unidimensional structure. The follow-up
confirmatory EMIRT analysis utilizing covariance matrices, corroborates the
impossibility of multiple dimensions to the strategic competence trait, as standard
errors of the estimates are insignificant and statistically untenable.
Figure 14. ISCEL Parallel analysis scree plot with university subjects (n =475) as
observed sample.Table 13.
Fit Comparisons of Hypothesized ISCEL Models
Model
df
(Parameter)
-2LL AIC BIC Adj. BIC
1D 184 (92) 22537.866 22721.866 23104.891 22812.896 2D 160 (116) 22537.614 22769.614 23252.559 22884.392 Note. The 2D model required parameter constraints for convergence.
As seen in Table 13, the predictive fit indices (AIC, BIC and Adjusted BIC) point
to the unidimensional structure as providing the most information, meanwhile the
difference in loglikelihood fits is insignificant. These model fit indices demonstrate
the superiority of the unidimensional model over the two-dimensional model in
describing the pattern of subject responses to the scale of strategic competence.
Furthermore, at the item level, the standard errors and concomitant significance of the
two-dimensional estimates shown in Table 14, highlight the poor fit to the observed
data. EMIRT analysis will force the structure stipulated by the researcher and
produce factor loadings, hence, every item loads on the respective factors with
generally acceptable values. However, these values were derived by constraining the
standard errors of the unstandardized estimates to zero with no significance. Thus,
the parameter estimates of a hypothetical two-dimensional model in Table 14 are
supplied by the statistical software with the caveat of “no confidence”. For these
reasons, it is obvious that the ISCEL scale measures strategic confidence along a
single continuum and will be treated as a unidimensional construct in the final CML
SR model.
Table 14.
ISCEL EMIRT Estimates for One and Two Dimensions
Unidimensional Two-Dimensional
In accordance with the validation procedures of the BELLA and ELLSI scales,
the ISCEL items next underwent DIF testing for uniform and non-uniform variance
across gender groupings. Mplus 7 was used to perform DIF analysis on the ISCEL
items using syntax for 1PL mixture modeling with known classes (1 = male, 0 =
female). The seven cases with unspecified gender were excluded from analysis. It
was found that seven items exhibited non-uniform DIF effects and three items showed
uniform DIF, albeit with negligible effect size. Results of non-uniform and uniform
DIF are provided in Tables 15 and 16, respectively.
Table 15.
ISCEL Non-uniform DIF Test Results (females are reference group) Baseline Model
20 0.595 23076.760 8.106 -0.121
21 0.605 23075.805 9.016 -0.132
22 0.545 23082.440 2.426 -
23 0.490 23084.834 0.032 -
Note. Effect sizes are reported only for items with significant DIF. Critical value for χ
2 (df = 1, p = 0.05) is 3.841.Table 16.
ISCEL Uniform DIF Test Results (females are reference group) Baseline Model
22 23082.796 2.070 -
23 23082.940 1.926 -
Note. Effect sizes are reported only for items with significant DIF. Critical value for χ
2 (df = 3, p = 0.05) is 7.815.The ISCEL scale of strategic competence exhibits moderate non-uniform gender
DIF and minimal uniform DIF effects on male responses to the items, similar to the
other self-assessed latent traits in the CML. Thus, certain items have a greater power
to discriminate high and low theta individuals when the respondents are male.
Regardless of the disparity in discrimination power, the disparity in endorsement
tendencies between the genders is minimal. Nevertheless, these results are
indicative of the necessity to covary gender on the strategic competence variable in
the final CML SR model.
Validating Measurement of Listening Abilities
In this section, descriptions of piloting procedures and data analysis specific to
the test of listening comprehension are provided. As mentioned previously, the test
of listening abilities consists of two parts modeled after Taiwan’s GEPT exam:
stimulus-response items and short conversation items. Each part consists of items
that underwent content review and were piloted separately using counterbalanced
forms to mitigate test and fatigue effects. Item calibration and logistic regression
tests of non-uniform and uniform DIF were conducted with Mplus 5.1. Item
analysis provided preliminary indication of candidate items to include in the finalized
test instrument. Items that were retained subsequently underwent a secondary
construct validity review to specify the set of listening skills contained in the
underlying factor structure of the test.
Results of Pilot Test
The stimulus-response items were piloted in the fall semester of 2013 at a single
university in eastern Taiwan with the identical sample of 315 (175 female, 140 male)
undergraduate students that piloted the BELLA and ELLSI questionnaires. The test
was administered immediately following completion of the questionnaires. The
short conversation questions were piloted with a separate sample of 304
undergraduate students (176 female, 128 male) in the spring semester of 2014 at the
same university. Results of calibration analysis for stimulus response items and
short conversation items are provided in Table 17.
Table 17.
1PL Estimates and Item Residualχ2 for Pool of Listening Test Items.
Stimulus-Response Short Conversation
Item Difficulty
SE
χ2 Item DifficultySE
χ21 -1.967 0.168 0.014 1 -1.195 0.153 0.018
2 -1.710 0.160 0.018 2 -0.772 0.145 0.018
3 -2.017 0.170 0.013 3 1.015 0.149 0.006
4 -0.487 0.136 0.018 4 0.399 0.142 0.000
Bolded estimates are significant at p < 0.05.
The procedure adopted for paring down both the stimulus-response and short
conversation item pools for use in the finalized testing instrument was to run IRT 1PL
analysis and consult measures of outlier sensitive fit (outfit). The outfit indices were
reviewed to identify candidate items for inclusion/exclusion in the final version of the
listening test. The Mplus program provides the standardized residuals and item χ2
as measures of outfit, and these indices did not identify outlier sensitive misfit.
Subsequent item analysis used logistic regression gender DIF detection to
ascertain metric and scale invariance, assuming significant gender DIF could also
mark items for exclusion in the final test form.
The DIF detection followed the logistic regression method performed with
Mplus, as described previously with the listening belief and strategic knowledge
inventories. A slight departure from the questionnaire analysis is that use of the
dichotomous test score models establishes the Δdf for the respective DIF models as
one (two df when testing for total DIF effects). The results of non-uniform and
uniform DIF testing for the respective item pools are shown in Tables 18 and 19.
Similar to the DIF results for the questionnaire instruments, the pilot tests
contain numerous non-uniform DIF-affected items, while uniform DIF effects are few
and minimal. The content of the DIF-affected item stems and option arrays were
re-examined by the present researcher to look for common features which could
explain the gender DIF. This content review found that many of the flagged items,
shown in Tables 20 and 21, were situated in the context of non-academic social
relationships.
Table 18.
Non-uniform and Uniform DIF Results for Stimulus-response Items.
Baseline Model
Std. a -2LL
df
baseline - dffocal.483 13432.84 1
Non-uniform DIF models Uniform DIF model
Focal
24 0.571 13431.88 0.970 - 13432.22 0.630
Note: Effect sizes are shown only for significantly DIF-affected items.
Table 19.
Non-uniform and Uniform DIF Results for Short Conversation Items.
Baseline Model
Std. a -2LL
df
baseline - dffocal0.483 10386.764 1
Non-uniform DIF models Uniform DIF model
Focal
16 0.304 10383.72 3.052 - 10385.88 0.888 -
Note: Effect sizes are shown only for significantly DIF-affected items.
Table 20.
Content of stimulus-response items flagged for gender DIF effects.
Stimulus-Response Stems and Options
DIF ∆R2
So, after all these years, we're still in love.
Oh, that's so sweet.
The teacher is in the office.
I know why she's a teacher.
-.28 .07