Note. Primary loadings are in bold - 聽力聯合模式：說明信念與策略對聽力理解能力影響之實證模式

The results of parallel analysis are summarized with a modified scree plot

overlaying the EFA eigenvalues of real observations against the average and 95^th

percentile of 50 parallel simulations. As shown in Figure 11, the first two factors of

the proposed ELLSI instrument yield significant eigenvalues, corroborating the

researcher’s interpretation of the factor loadings in Table 7.

Confirmatory MIRT compared the unidimensional model and the

two-dimensional measurement model stipulated by parallel analysis and identified the

latter model as the more accurate reflection of the observed pattern of responses.

The fit indices from this MIRT analysis are shown in Table 8. Table 9 displays the

ELLSI indicators and standardized loadings in the currently identified structure,

which exhibit a high intercorrelation at 0.78 (S.E. = 0.04, p = 0.00).

Figure 11. ELLSI Parallel analysis scree plot with university subjects (n =315) as

observed sample.

Table 8.

Fit Comparisons of Hypothesized ELLSI Models

Model

df (Parameter) -2LL AIC BIC Adj. BIC

1D 161 (115) 16880.16 17110.15 17541.7 17176.95 2D 160 (116) 16739.94 16971.95 17407.25 17039.33

Table 9.

ELLSI MIRT Two-dimensional Structure for University Subjects n = 315.

Item Descriptor Std a

R

1 I pay attention to the main points of the conversation in

English to get a general understanding of what is said. 0.67 0.45 2 I listen to the other person’s speech to determine if he/she

has understood me correctly. 0.64 0.41

3 I guess the meaning of unknown words or expressions by

noticing redundant words or phrases with similar meaning. 0.58 0.34 Dimension

4 I guess the meaning of unknown words by noticing the

speaker’s tone of voice. 0.61 0.37

5 I guess the meaning of unknown words by noticing the

gestures, actions, or facial expressions of the speaker. 0.60 0.36 6 I guess the speaker’s attitude toward the topic of discussion

by noticing redundant words or phrases with similar meaning.

0.75 0.56

7 I guess the speaker’s intentions by noticing the gestures,

actions, or facial expressions of the speaker. 0.66 0.44 9 I pay attention when the speaker communicates new or

important information by noticing the intonation or stress on words.

0.71 0.51

10 I use personal experience to understand the speaker’s

meaning and intentions. 0.74 0.55

11 I use my knowledge of the world to understand the

speaker’s meaning and intentions. 0.76 0.57

12 I use my knowledge learned from school to understand the

speaker’s meaning and intentions. 0.62 0.41

13 I pay attention to English words or expressions that are

similar to Chinese words or expressions. 0.56 0.31 15 When listening to CD texts, I use the title to guess the 0.52 0.27

content or main idea of what I will hear.

22 I mentally prepare to listen by reviewing what I know and

don’t know about the topic. 0.49 0.24

23 I use knowledge of English stress and intonation to help me

figure out words spoken unclearly. 0.55 0.30

8 I judge how well I was able to understand the other person’s

speech. 0.51 0.26

14 I figure out the relationship between events when listening

to a passage. 0.69 0.48

16 When studying outside of class, I pay attention to my

feelings about the listening passage. 0.69 0.47

17 When studying outside of class, I make sure to choose

listening passages/materials that I like. 0.60 0.36 Dimension

18 When listening to a difficult passage outside of class, I group words and expressions together based on common features.

0.58 0.34

19 When listening to a passage in class, I pay attention to my

feelings about the passage. 0.71 0.51

20 When conversing in English, I identify chunks of words, or

phrases, rather than single words that the other person says. 0.73 0.54 21 When listening to a difficult passage, I identify chunks of

words, or phrases, rather than single words. 0.67 0.45

The high intercorrelation of the strategic dimensions requires further analysis.

The intercorrelation is higher than the AVE, suggesting that the two dimensions lack

discriminant validity (Blunch, 2008; Kline, 2005). However, nested model

comparisons shown in Table 8 effectively rule out the parsimonious hypothesis of a

single continuum of trait listening strategy use. Nevertheless, an alternative,

second-order factor structure must also be tested to ascertain the presence/absence of

a higher order general strategy use dimension, g. The presence of g could account

for the oblique, between-factor correlation, whereas the absence of g would suggest

that obliqueness is a reflection of the contingency relation between the two constructs,

and should be construed as a characteristic of these latent variables. The comparison

of hypothetical models is summarized in Figures 12 and 13, meanwhile, the results of

nested model comparisons are shown in Table 10.

Figure 12. ELLSI Hypothetical second-order factor structure.

Figure 13. ELLSI Baseline oblique, first-order factor structure.

Table 10.

Fit Comparisons of Second-order and First-order Factor Structures.

Model

df (Parameter) -2LL AIC BIC Adj. BIC

2^nd order 159 (117) 16740.74 16974.74 17413.79 17042.70 1^st order 160 (116) 16739.94 16971.95 17407.25 17039.33

The first-order baseline structure yields the best fit across all the indices,

although the difference is marginal, and the -2LL test indicates non-significance with

1df. However, every fit index shows concurrence that the parsimonious, first-order

model without an underlying general strategy factor, g, is the one with less residual

discrepancy.

Accordingly, review of the ELLSI indicators in the first-order structure as

currently identified suggest two extant dimensions to the variable of listening strategic

knowledge: Offline and Online strategies. The Offline strategic dimension is

indicated by items which fall exclusively under the canonical categories of cognitive

and metacognitive strategies across one-way and two-way, interactive listening.

This type of listening strategy is deliberative and likely occurs asynchronous to the

stream of aural input, hence the label “offline” in reference to cognitive processing

theories. The Online strategies also span one-way and two -way listening, but

include strategies normally identified as affective in addition to others which are more

readily recognized as cognitive or metacognitive. This type of strategy is availed

during processing synchronous to the stream of aural input, and must be utilized under

temporal constraints. Hence, under the two dimensional model, they are measured

along a continuum distinct from the Offline strategies.

The salient theme within the Online strategy indicators is the inclusion of

affect-valenced indicators describing feelings, emotions, or “gut-feelings”. The

reader should note item eight, describing a metacognitive strategy of monitoring in

which the item is worded generally by querying whether respondents self-assess their

listening skills. Because this item is situated in live interaction with strict temporal

constraints on processing, it may be conjectured that L2 listeners require

affect-valenced intuition to render synchronous judgment. Also, chunking strategies

in items 20 and 21 are generally considered cognitive strategies. Here again, the

items may serve as locus for the listener’s affective responses to influence the

implementation of the strategy. Since the strategy is in the setting of live interaction

(item 20) or listening to a difficult passage that may be other-controlled in a

classroom setting (item 21), it is again possible to conjecture that thought units are

being constructed out of the aural input based on the listeners’ intuitions due to strict

temporal constraints on aural processing. In this fashion, the two dimensions appear to

represent different types of thinking- slow, deliberative “offline” thinking versus

instantaneous, real-time, “online” thinking. Further analysis in the main study was

required to confirm or disconfirm this hypothetical dichotomy in the listening

strategic knowledge dimensions.

Akin to the validation of the BELLA scale, the ELLSI items next underwent DIF

testing for uniform and non-uniform variance across gender groupings. Mplus 5.1

was used to perform DIF analysis on the ELLSI items using syntax for 1PL mixture

modeling with known classes (1 = male, 0 = female). It was found that eight items

exhibited non-uniform DIF effects and only one item showed uniform DIF, albeit with

negligible effect size. Results of non-uniform and uniform DIF are provided in

Tables 11 and 12, respectively.

Table 11.

ELLSI Non-uniform DIF Test Results (females are reference group)

Baseline Models

23 0.470 18142.798 0.050

Note. Effect sizes are reported only for items with significant DIF. Critical value for χ

²(df = 1, p = 0.05) is 3.841.

Table 12.

ELLSI Uniform DIF Test Results (females are reference group) Baseline Models

14 18779.964 4.606

Online 16 18780.236 4.334 -

17 18768.364 16.206 -0.063

18 18781.430 3.14 -

19 18783.472 1.098 -

21 18777.868 6.702 -

22 18781.178 3.392 -

Note. Effect sizes are reported only for items with significant DIF. Critical value for χ

²(df = 4, p = 0.05) is 9.488.

Similar to the BELLA DIF analyses, the estimated dimensional loadings of

ELLSI items tend to be higher for males, the minority group, meaning that numerous

items have reduced power to discriminate females with high and low propensities of

theta witnessed by the lower slope gradient. Likewise, the group-item interactions in

the ELLSI scale have negligible impact on respondents’ tendencies to endorse items,

the uniform DIF, which is evident to a small degree (-.063) in item 17 only. These

results should be taken as tentative, considering the wide sampling disparity between

the gender groupings and the sub-optimal ratio of cases to estimated parameters in the

DIF models.

Validating Measurement Model of Strategic Competence

Descriptions of piloting procedures and data analysis specific to the application

of the ISCEL questionnaire on university undergraduates are provided. The

procedure for validating ISCEL was similar to the validation of the ELLSI and

BELLA scales wherein EMIRT, confirmatory MIRT, and logistic regression DIF

testing was utilized to identify and confirm invariance of the ISCEL structure. To

accomplish piloting of the ISCEL inventory, responses from a sample of 475 (222

male, 246 female, 7 unspecified) university undergraduates of mixed major of study

were collected from public and private universities in northern and eastern Taiwan at

the start of the fall semester in 2014. The entire sample of 475 participants in the

ISCEL pilot was independent of previous samples for cross-validation and piloting of

the BELLA and ELLSI scales. An additional notable feature of this round of

piloting is that the ISCEL instrument was configured with a four-point response scale

(0-3). The scale was reduced in an attempt to economize parameterization in the

anticipated SR model, which, if the 40 listening exam items were retained as

individual observable variables, would yield 104 indicators across six latent

dimensions. The sum of thresholds and loadings for antecedent and criterion

variables in the final CML could render MLR estimation untenable for standard

notebook and desktop PCs.

Results of ISCEL Pilot

Investigation of the dimensional structure of the ISCEL observed responses

began with parallel analysis to expediently narrow the focus of further analyses.

Parallel EFA analysis, wherein eigenvalues derived from the observed data are

compared against those derived from average and 95^th percentile of 50 normally

distributed simulated samples, indicated a unidimensional structure to ISCEL

responses. These results are summarized in Figure 14 showing the overlay of scree

plots. Subsequent rounds of EMIRT analyses compared unidimensional and two

dimensional variants to confirm or disconfirm the parallel EFA results. The results

of these analyses confirm the unidimensionality of the ISCEL scale, as the

unidimensional model converged smoothly, yielding a complete set of significant

loadings and threshold estimates, whereas the two-dimensional model failed to

converge without numerous parameter constraints, which resulted in a complete set of

non-significant loadings and threshold estimates. Tables 13 and 14 summarize the

results of comparative EMIRT analyses of the ISCEL items.

As shown in Figure 14, the eigenvalue for the second factor in the observed data

correlation matrix falls below both the average and 95^th percentile of the 50 simulated

samples. The observed data yielded a second eigenvalue of 1.260, whereas the

average and 95^th percentile eigenvalues for the second factor were 1.356 and 1.332,

respectively. This means that the eigenvalues for the second and subsequent vectors

in the observed data matrix, although greater than one, are still neither significantly

greater nor substantively more informative than eigenvalues derived from random

observations, thus indicative of a simple unidimensional structure. The follow-up

confirmatory EMIRT analysis utilizing covariance matrices, corroborates the

impossibility of multiple dimensions to the strategic competence trait, as standard

errors of the estimates are insignificant and statistically untenable.

Figure 14. ISCEL Parallel analysis scree plot with university subjects (n =475) as

observed sample.

Table 13.

Fit Comparisons of Hypothesized ISCEL Models

Model

df

(Parameter)

-2LL AIC BIC Adj. BIC

1D 184 (92) 22537.866 22721.866 23104.891 22812.896 2D 160 (116) 22537.614 22769.614 23252.559 22884.392 Note. The 2D model required parameter constraints for convergence.

As seen in Table 13, the predictive fit indices (AIC, BIC and Adjusted BIC) point

to the unidimensional structure as providing the most information, meanwhile the

difference in loglikelihood fits is insignificant. These model fit indices demonstrate

the superiority of the unidimensional model over the two-dimensional model in

describing the pattern of subject responses to the scale of strategic competence.

Furthermore, at the item level, the standard errors and concomitant significance of the

two-dimensional estimates shown in Table 14, highlight the poor fit to the observed

data. EMIRT analysis will force the structure stipulated by the researcher and

produce factor loadings, hence, every item loads on the respective factors with

generally acceptable values. However, these values were derived by constraining the

standard errors of the unstandardized estimates to zero with no significance. Thus,

the parameter estimates of a hypothetical two-dimensional model in Table 14 are

supplied by the statistical software with the caveat of “no confidence”. For these

reasons, it is obvious that the ISCEL scale measures strategic confidence along a

single continuum and will be treated as a unidimensional construct in the final CML

SR model.

Table 14.

ISCEL EMIRT Estimates for One and Two Dimensions

Unidimensional Two-Dimensional

In accordance with the validation procedures of the BELLA and ELLSI scales,

the ISCEL items next underwent DIF testing for uniform and non-uniform variance

across gender groupings. Mplus 7 was used to perform DIF analysis on the ISCEL

items using syntax for 1PL mixture modeling with known classes (1 = male, 0 =

female). The seven cases with unspecified gender were excluded from analysis. It

was found that seven items exhibited non-uniform DIF effects and three items showed

uniform DIF, albeit with negligible effect size. Results of non-uniform and uniform

DIF are provided in Tables 15 and 16, respectively.

Table 15.

ISCEL Non-uniform DIF Test Results (females are reference group) Baseline Model

20 0.595 23076.760 8.106 -0.121

21 0.605 23075.805 9.016 -0.132

22 0.545 23082.440 2.426 -

23 0.490 23084.834 0.032 -

Note. Effect sizes are reported only for items with significant DIF. Critical value for χ

²(df = 1, p = 0.05) is 3.841.

Table 16.

ISCEL Uniform DIF Test Results (females are reference group) Baseline Model

22 23082.796 2.070 -

23 23082.940 1.926 -

Note. Effect sizes are reported only for items with significant DIF. Critical value for χ

²(df = 3, p = 0.05) is 7.815.

The ISCEL scale of strategic competence exhibits moderate non-uniform gender

DIF and minimal uniform DIF effects on male responses to the items, similar to the

other self-assessed latent traits in the CML. Thus, certain items have a greater power

to discriminate high and low theta individuals when the respondents are male.

Regardless of the disparity in discrimination power, the disparity in endorsement

tendencies between the genders is minimal. Nevertheless, these results are

indicative of the necessity to covary gender on the strategic competence variable in

the final CML SR model.

Validating Measurement of Listening Abilities

In this section, descriptions of piloting procedures and data analysis specific to

the test of listening comprehension are provided. As mentioned previously, the test

of listening abilities consists of two parts modeled after Taiwan’s GEPT exam:

stimulus-response items and short conversation items. Each part consists of items

that underwent content review and were piloted separately using counterbalanced

forms to mitigate test and fatigue effects. Item calibration and logistic regression

tests of non-uniform and uniform DIF were conducted with Mplus 5.1. Item

analysis provided preliminary indication of candidate items to include in the finalized

test instrument. Items that were retained subsequently underwent a secondary

construct validity review to specify the set of listening skills contained in the

underlying factor structure of the test.

Results of Pilot Test

The stimulus-response items were piloted in the fall semester of 2013 at a single

university in eastern Taiwan with the identical sample of 315 (175 female, 140 male)

undergraduate students that piloted the BELLA and ELLSI questionnaires. The test

was administered immediately following completion of the questionnaires. The

short conversation questions were piloted with a separate sample of 304

undergraduate students (176 female, 128 male) in the spring semester of 2014 at the

same university. Results of calibration analysis for stimulus response items and

short conversation items are provided in Table 17.

Table 17.

1PL Estimates and Item Residualχ² for Pool of Listening Test Items.

Stimulus-Response Short Conversation

Item Difficulty

SE

χ² Item Difficulty

SE

χ²

1 -1.967 0.168 0.014 1 -1.195 0.153 0.018

2 -1.710 0.160 0.018 2 -0.772 0.145 0.018

3 -2.017 0.170 0.013 3 1.015 0.149 0.006

4 -0.487 0.136 0.018 4 0.399 0.142 0.000

Bolded estimates are significant at p < 0.05.

The procedure adopted for paring down both the stimulus-response and short

conversation item pools for use in the finalized testing instrument was to run IRT 1PL

analysis and consult measures of outlier sensitive fit (outfit). The outfit indices were

reviewed to identify candidate items for inclusion/exclusion in the final version of the

listening test. The Mplus program provides the standardized residuals and item χ²

as measures of outfit, and these indices did not identify outlier sensitive misfit.

Subsequent item analysis used logistic regression gender DIF detection to

ascertain metric and scale invariance, assuming significant gender DIF could also

mark items for exclusion in the final test form.

The DIF detection followed the logistic regression method performed with

Mplus, as described previously with the listening belief and strategic knowledge

inventories. A slight departure from the questionnaire analysis is that use of the

dichotomous test score models establishes the Δdf for the respective DIF models as

one (two df when testing for total DIF effects). The results of non-uniform and

uniform DIF testing for the respective item pools are shown in Tables 18 and 19.

Similar to the DIF results for the questionnaire instruments, the pilot tests

contain numerous non-uniform DIF-affected items, while uniform DIF effects are few

and minimal. The content of the DIF-affected item stems and option arrays were

re-examined by the present researcher to look for common features which could

explain the gender DIF. This content review found that many of the flagged items,

shown in Tables 20 and 21, were situated in the context of non-academic social

relationships.

Table 18.

Non-uniform and Uniform DIF Results for Stimulus-response Items.

Baseline Model

Std. a -2LL

df

_baseline- df_focal

.483 13432.84 1

Non-uniform DIF models Uniform DIF model

Focal

24 0.571 13431.88 0.970 - 13432.22 0.630

Note: Effect sizes are shown only for significantly DIF-affected items.

Table 19.

Non-uniform and Uniform DIF Results for Short Conversation Items.

Baseline Model

Std. a -2LL

df

_baseline- df_focal

0.483 10386.764 1

Non-uniform DIF models Uniform DIF model

Focal

16 0.304 10383.72 3.052 - 10385.88 0.888 -

Note: Effect sizes are shown only for significantly DIF-affected items.

Table 20.

Content of stimulus-response items flagged for gender DIF effects.

Stimulus-Response Stems and Options

DIF ∆R²

So, after all these years, we're still in love.

Oh, that's so sweet.

The teacher is in the office.

I know why she's a teacher.

-.28 .07

在文檔中聽力聯合模式：說明信念與策略對聽力理解能力影響之實證模式 (頁 144-167)