establishment of reliability and validity, test inequivalence and the predictive power of the RCT subtests.

(1)

CHAPTER FOUR RESULTS AND DISCUSSION

In this chapter, the results of statistical analyses are presented and discussed in the first section. The three phases of data computation comprise analyses of reliability and validity, test equating and regression analyses. For reliability analysis, Cronbach’s coefficients were performed on the DST and the RCT. For validity, the analyses tackled the RCT, insomuch as it was experimentally designed to meet purposes of the exploratory study. Compared with the DST, a standardized, high-stakes test, the RCT demanded a meticulous examination. Test equating and regression analyses were performed on both tests, so as to shed more light on the interrelationship between these gap-filling tests under the hypothesized construct of cohesion. In the second section, discussion and explanation of research findings are emphasized on the

establishment of reliability and validity, test inequivalence and the predictive power of the RCT subtests.

Overall Results

In the section, the general descriptive statistics of individual test measures are presented. As shown in Table 4 to 7, except the 11th grade, the means of the RCT are slightly higher than those of the DST across the remaining grade levels. The

differences were essentially small: .55, .87, .33, and 1.08 for Grades 10–12, Grade 10, Grade 11 and Grade 12, respectively

¹

. Overall, the RCT was shown to be an easier test format than the DST.

8 The results of statistical significance of differences in means are presented in the section of test

(2)

Table 4. Descriptive Statistics for Grades 10–12

RCT DST

n

k M SD α SEM

354 30 21.09 4.84 .80 .257

354 30 20.54 5.45 .84 .290

Note: k represents the total number of test items.

Table 5. Descriptive Statistics for Grade 10

RCT DST

n

k M SD α SEM

124 30 19.42 4.79 .80 .43

124 30 18.55 5.52 .83 .50

Table 6. Descriptive Statistics for Grade 11

RCT DST

n

k M SD α SEM

115 30 20.50 5.02 .81 .47

115 30 20.83 5.40 .84 .50

(3)

Table 7. Descriptive Statistics for Grade 12

RCT DST

n

k M SD α SEM

115 30 23.48 3.65 .70 .34

115 30 22.40 4.70 .80

.44

Table 8 presents the mean scores of the three RCT subtests. As can be seen in the table, among the three, Cloze B (lexical cohesion) tends to be the most difficult subtest, while Cloze A (reference) appears the easiest. This pattern remained across different grade levels except Grade 12, in which Cloze C (conjunction) was shown to be the easiest subtest.

Table 8. Mean Scores of the RCT Subtests

Cloze A Cloze B Cloze C

Grade 10

Grade 11 Grade 12 Grades 10–12

7.27 7.32 7.97 7.51

5.61 6.11 7.43 6.37

6.54 7.07 8.06 7.21

To show the difference of means between the subtests, a repeated-measure ANOVA and multiple comparisons were performed. The results are shown in Table 9.

The comparisons exhibited statistically significant differences of means between each

pair of the three subtests. Therefore, based on differences in mean scores, the

(4)

Table 9. A Repeated-Measure ANOVA and Multiple Comparisons for the RCT Subtests

Mean Square df F p Group 124.73 2 62.67*** .000 Residual 1.99 706

Note. ***p .001

Cloze A Cloze B Cloze C

Cloze A – 112.55***

Cloze B – Cloze C

9.12**

60.42***

–

Note. ***p .001, **p .01

Concerning the overall data, Table 10 presents the Pearson product-moment correlation coefficients between the test components. With the alpha level preset at p

< .001, all of the tests were found to correlate with each other significantly. The three RCT subtests were found to correlate at a low-intermediate level, suggesting that the three may tap separate constructs of cohesion. This finding may lend further support to results previously shown that the three subtests were statistically different tests (see Table 8 and Table 9).

The terms grades correlated most highly with the RCT, while the coefficients

with the remaining tests were lower than .50. The result should be interpreted with

cautions. As previously stated, the term grades were not assigned to each participant

based on the same measure. Therefore, Table 11 to 13 may provide a clearer trajectory

about the correlation between the term grades and the other tests within specific grade

(5)

levels. In the tables, the term grades were still shown to correlate most highly with the RCT. In actuality, the term grades comprised various test components (e.g., monthly exams, listening, writing, grammar, etc.) at the school. Higher correlations between the term grades and the RCT may imply the multi-dimensional, integrative nature of the RCT.

Table 10. An Intercorrelation Matrix for the Tests (Grades 10 –12)

1 2 3 4 5 6

1. Cloze A –

2. Cloze B 3. Cloze C 4. RCT 5. DST

6. Term Grades

.51 –

.44 .55 –

.77 .87 .81 –

.51 .59 .51 .66 –

.49 .44 .39 .53 .45 –

Note. All correlations are significant at ***p < .001.

Table 11. An Intercorrelation Matrix for the Tests (Grade 10)

1 2 3 4 5 6

1. Cloze A –

2. Cloze B 3. Cloze C 4. RCT 5. DST

6. Term Grades

.47 –

.31 .48 –

.72 .86 .76 –

.50 .65 .52 .72 –

.49 .66 .48 .70 .62 –

(6)

Table 12. An Intercorrelation Matrix for the Tests (Grade 11)

1 2 3 4 5 6

1. Cloze A –

2. Cloze B 3. Cloze C 4. RCT 5. DST

6. Term Grades

.53 –

.54 .53 –

.82 .84 .83 –

.51 .53 .50 .62 –

.57 .54 .53 .66 .49 –

Table 13. An Intercorrelation Matrix for the Tests (Grade 12)

1 2 3 4 5 6

1. Cloze A –

2. Cloze B 3. Cloze C 4. RCT 5. DST

6. Term Grades

.44 –

.36 .45 –

.73 .85 .76 –

.45 .43 .31 .51 –

.54 .42 .41 .57 .54 –

Analyses of Reliability and Validity

In the subsection, the results of the internal reliability coefficients (Cronbach’s alpha) for the DST and the RCT are presented. An analysis of validity is also reported.

Analyses of Reliability of the Discourse Structure Test and the Rational Cloze Test Table 4 shows Cronbach’s alpha coefficients of the DST and RCT. Considering the whole subject pool, the DST was more reliable than the RCT, with a slight

difference of .04 (.84 – .80). In Manning’s (1986) study, the multiple-choice cloze test

registered a reliability coefficient of .80. The obtained coefficient for the

(7)

multiple-choice RCT in the present study may be considered satisfactory. Table 5 to 7 present Cronbach’s alpha coefficients of both tests at individual grade levels. For Grades 10 and 11, the reliability coefficient of the DST remained slightly higher than that of the RCT, with a difference of .03 (.83 – .80). For Grade 12, however, a larger difference of .10 (.80 – .70) was identified.

To briefly sum up, concerning the data as a whole and at specific levels, the spread of Cronbach’s alpha coefficients was shown to be more consistent for the DST, ranging from .80 to .84, while a wider distribution was observed for the RCT,

spanning from .70 to .81. Administered to senior high school students of all grade levels, the DST could be considered a more stable, reliable test format.

Analyses of Validity of the Rational Cloze Test

Evidence for validity of the RCT was garnered in light of content validity and concurrent validity. For content validity, a qualitative approach

²

was performed by concerned parties in the field of language studies and in-service English teachers at the school where the present study was conducted. Concurrent, criterion-related validity was sought for by the correlation coefficient between the DST and the RCT, both of which were hypothesized to tap the construct of cohesion.

Analysis of content validity. Through collective judgments from three graduate students (including the researcher) and three experienced teachers whose students participated in the study, a consensus was reached that the RCT measured the test-takers’ knowledge of cohesion in discourse. Based on the a priori discourse analysis, subcategories of the three major types of cohesion were as proportionally sampled as possible, which would to some extent be representative of the

hypothesized constructs of cohesion.

9The rationale for not using a quantitative approach (i.e., factor analysis) to the analysis of construct

(8)

Analysis of concurrent validity. Evidence for concurrent validity was supported by the intercorrelation between the DST and the RCT, inasmuch as both were

hypothesized to measure knowledge of intersentential cohesion and were administered within a short period of time (i.e., within two weeks). In addition, because the DST has been an established, high-stakes test incorporated in the DRET, it may serve as a qualified external criterion. A moderate-high correlation of .66 was shown for the two gap-filling tests (see Table 10). Therefore, the presence of

concurrent validity of the RCT as an exploratory test format could be claimed.

To briefly sum up, validity of the RCT could be maintained to a satisfactory extent based on analyses of content validity and concurrent validity. However, more evidence for construct validity will be absolutely necessitated to broaden the horizon of the cloze research for the RCT to be a measure of cohesion.

Test Equating

The equating of the DST and the RCT was performed by classical equating methods. The methods were applied to examine the equivalence of different test forms in means, variances, and inter-form covariance. The results are presented as follows.

Testing the Equivalence of Means

As a whole, the mean of the RCT was slightly higher than that of the DST, with a difference of .55 (20.54 – 21.09). A paired-sample t-test was used to confirm

statistical significance of the difference. The result indicated that a difference of .55, though small, remained statistically significant with the alpha level preset at p < .05

t

= 2.43, df = 353, p = .02). Therefore, in terms of means, the DST and the RCT were not equivalent test forms.

While the global pattern was shown that both tests were not equivalent, local

differences were observed. For individual grade levels, the mean scores of the DST

and the RCT were fairly close. In Grade 10 and Grade 12, the means of both tests

(9)

were still shown to be significantly different with α preset at p < .05, t (123) = 2.47 and t (114) = 2.73, respectively. However, an interesting pattern was identified in Grade 11. The difference in means was shown to be statistically non-significant, t (114) = -.76. In other words, at this level the DST and the RCT were found to be equivalent in light of the equivalence of means.

Testing the Equivalence of Variances

In the second phase of classical equating methods, the equivalence of variance distributions was examined. As can be seen in Table 14, for the whole sample the difference of variances between the DST and the RCT is 6.28 (29.69 – 23.41). The homogeneity of variances was examined by the F-max test based on the equation: t =

 





 





−

− 2 4 1

2 2 2 2 1

2 2 2 1

n r s

s s

s , df = n – 2

where S

12

and S

22

refer to sample variances and r

²

refers to the square of the sample correlation coefficients. With the alpha level preset at p < .05, the difference of

variances was found to be statistically significant. The two-tailed t-ratio of -2.98 (df = 352) far exceeded the critical t-ratio of -1.96, rejecting the null hypothesis that there was no statistically significant difference in variances.

However, still an interesting pattern was identified in Grade 11. As shown in

Table 14, except Grade 11, significant differences in the spread of variances between

both tests across grade levels were confirmed by the F-max tests. On the contrary, for

Grade 11 the result showed statistical non-significance of the difference in variances,

with the t-ratio of -0.98 anchoring within the critical t-ratio of -1.96.

(10)

Table 14. Testing the Equivalence of Variance

Variance r t

RCT DST

Grade 10

Grade 11 Grade 12 Grades 10–12

22.95 25.23 13.36 23.41

30.46 29.15 22.05 29.69

.72 .62 .51 .66

-2.26*

-0.98 -3.12*

-2.98*

Note. t-distribution is significant at *p < .05

Testing the Equivalence of Inter-Form Covariance

After the examination of equivalence in means and variances, the results of analyses of inter-form covariance, the final stage of classical equating methods, is presented in this subsection.

First of all, a significant correlation between the DST and the RCT required confirmation. As previously shown in Table 10, the correlation coefficient between the tests was found to be .66 for Grades 10–12, which was statistically significantly at p < .001. Correlations of the DST and the RCT with the term grades were found

statistically significant at .45 and .53, respectively. Only when significance of correlations were confirmed could the equating of inter-form covariance be legitimately performed.

Evidence for equivalence in inter-form covariance would show the difference in the correlation between the RCT and the term grades (r = .53) and the correlation between the DST and the term grades (r = .45). Hotelling’s (1940) t-test was utilized, written in the following equation:

t = ( ) ^{( )} ( )

( r r r r r r r )

r

xy xz yz xy xz yz xy

xz yz

n

2 1

2 1 3

2 2

2

− − +

−

+

− − df = n – 3

(11)

A difference of .08 (.53 – .45) was found to reach the significant level at p < .05, with a t-ratio of -2.15. All subject pool considered, the result suggests that the DST and the RCT were not statistically equivalent tests, inasmuch as equivalence in inter-form covariance failed to be satisfied.

The inquivalence for both tests in inter-form covariance may be examined solely in terms of the correlation coefficient. That is, the correlation between the DST and the RCT may shed some light on the likelihood of equivalence in inter-form

covariance without incorporating the term grades into data computation. In Beglar and Hunt’s (1999) study, two revised test forms of the 2000 Word Level Test (Nation, 1990) were shown to be equivalent by classical equating methods. These two forms correlated at a high of r = .90. In the same study, two revised test forms of the University Word Level Test (Nation, 1990) were also found to be equivalent, highly correlated at r = .84. Compared to such high correlations, the coefficient of .66 for the DST and the RCT was essentially not as high. Seen from this perspective, the failure in inter-form covariance could be explained.

Although the equivalence of inter-form covariance was shown absent for the whole subject pool, some local patterns were observed in Grade 10 and Grade 12.

Table 15 contains individual correlation coefficients between the DST, the RCT and the term grades at each grade level. While significant differences in inter-form covariance remained for Grade 11 and Grades 10 – 12, non-significance was

confirmed for Grade 10 and Grade 12, as the t-ratios lay within the critical t of 1.96.

Statistically speaking, for Grade 10 and Grade 12 the DST and the RCT were

equivalent in light of inter-form covariance.

(12)

Table 15. Testing the Equivalence of Inter-Form Covariance

r

xy

r

^xz

r

^yz

Hotelling’s t

Grade 10 Grade 11 Grade 12 Grades 10–12

.72 .62 .51 .66

.70 .66 .57 .53

.62 .49 .54 .45

-1.75 -2.76*

-0.36 -2.15*

Note.

1. t-distribution is significant at *p < .05

2. rxy represents the correlation coefficient of the RCT (x) and the DST (y).

3. rxz represents the correlation coefficient of the RCT (x) and the Term Grade (z).

4. ryz represents the correlation coefficient of the DST (y) and the Term Grade (z).

5. Correlation is significant at the .01 level (2-tailed).

Regression Analyses

In this subsection, the results of regression analyses are presented, following the order: a) linearity of the regression model; b) a simple linear regression analysis; and c) a linear multiple regression analysis. The extent to which the RCT as a whole and its major subtests, i.e., Cloze A (reference), Cloze B (lexical cohesion) and Cloze C (conjunction), exerted influence on the prediction of the DST, are exemplified in detail.

Linearity of a Regression Model

To confirm the linearity of a regression model, patterns of regression and

residuals were first examined with a scatterplot and a Probability-Probability (P-P)

plot. Figure 1 shows a scatterplot for the RCT on the X axis and the DST on and the Y

axis. As the plot shows, a positive relationship between both tests can be revealed and

under the least squares principle a linear regression line can be visualized as serving

the best-fitting model for the regression of the DST on the RCT.

(13)

Figure 1. A Scatterplot for the RCT and the DST (Grades 10 – 12)

After the initial examination of the regression of the DST, the distribution of residuals was further sought for by means of a P-P plot. The P-P plot can be

performed to test the assumption of normality: the sampling distribution of residuals against the hypothesized, expected normal distribution (Cook & Weisberg, 1994, p.

209). First, the observed residuals were reordered and ranked from the smallest to largest as r

i

(i = 1, 2, 3…, n). The residuals were then standardized and transformed into probability values. The ordinal values of r

i

’s were also converted into probability values as percentile ranks (p), based on the assumption of normal distribution. With slight differences, a variety of functions for p have been proposed as follows (Blom, 1958; Bowerman & O’Connel, 1990; Montgomery & Peck, 1982):

8 1 3

+

= − n

i

p or

1 3

+

= − n

p i

or

n

p i 2

− 1

=

(14)

If the observed standardized residuals meet the hypothesized residuals, a perfect normal distribution can be conceptualized as a 45-degree straight line, along which the observed and the expected residuals fall. As shown in Figure 2, the standardized residuals of the DST on the RCT generally cluster closely around the perfect diagonal line, a 45-degree straight line with a slope of 1. In line of the distribution of residuals,

“a distinct straight-line appearance” was approximately validated (Bowerman &

O’Connel, 1990, p. 247). This suggests that a linear regression model, instead of a curvilinear one, could be claimed as the best-fitting equation accounting for the RCT as a predictor and the DST as a criterion.

Figure 2. A Normal P-P Plot

Simple Linear Regression Analysis

Holding the RCT as the sole predictor, a simple linear regression analysis was

first computed for the constant and regression coefficient. As shown in Table 16, a

simple regression equation can be derived and written as Y = 4.84 + .74 X where Y

represents the predicted score or regression of the DST, while X refers to the RCT

score. For instance, if a student has a score of 25 on the RCT, his or her DST is

(15)

predicted to be around 23 (4.84 + .7425 = 23.34). The equation can be expressed in terms of the standardized beta coefficient or “beta weight” (Younger, 1985, p. 164): Y*

= .66X. This means that an increase of one standard deviation in X (i.e., the RCT) is predicted to cause an increase of .66 standard deviation above the mean in Y (i.e., the DST). On average, the simple regression equation, either expressed in raw or

standardized scores, would serve as the best-fitting line under the method of least squares, which minimizes the residual sum of squares.

Table 16. Regression Coefficients for the RCT as a Predictor (Grades 10–12) Model Unstandardized Stadndardized t Sig.

Coefficients (B) Coefficients (Beta)

Constant

RCT

4.84 .74 .66

4.96 16.53

.000*

Linear Multiple Regression Analysis

After considering the RCT as a sole predictor, patterns of Cloze A, B and C in predicting performance on the DST were analyzed. Because the three were entered as individual predictors, instead of a simple one-predictor regression model, a

multiple-predictor model was applied. Table 17 shows the results. Adopting the standardized beta coefficients β ), the equation can be written as Y = .23X

1

+ .36X

2

+ .22 X

3

where Y represents the standardized score of the DST and X

1

, X

2

and X

3

refer to the standardized scores of Cloze A, B, and C, respectively. With the other

predictors held constant, Cloze B (lexical cohesion) was loaded with the highest beta

(16)

with Cloze A ( β = .23) and C ( β = .22), respectively. In other words, of the three predictors, Cloze B was the most explanatory variable, imposing stronger predictive power and influence on the regression of the DST.

Table 17. Regression Coefficients for the RCT Subtests as Predictors (Grades 10–12) Variable Unstandardized Stadndardized t Sig.

Coefficients (B) Coefficients (Beta)

Constant 5.12

Cloze A .75 .23 4.85 .000*

Cloze B .85 .36 6.94 .000*

Cloze C .61 .22 4.41 .000*

After the computation of beta weights for each subtest, a linear multiple

regression model was performed for R-square (R

²

), the coefficient of determination.

Table 18 exhibits the results using the Enter method by simultaneously submitting the predictors to the model. As shown in the table, Cloze C, the predictor loaded with the lowest beta coefficient, was first fitted into the model followed by Cloze A, the predictor loaded with the second highest beta coefficient. Cloze B was the last to be added. Based on the adjusted R

²

, it was shown that .26 of the shared variability in the performance on the DST was accounted for by Cloze C alone. To increase the

accuracy of prediction, Cloze A was then added to the first model. It is shown in the same table that adding Cloze A resulted in an increase of R

²

by .10 (.36 – .26), which was statically significantly according to F change. When Cloze B was finally entered, the aggregate adjusted R

²

amounted to.43, registering an increase of .07 (.43 – .36).

That is, the predictive power of Cloze B could be construed as contributing .07 of the

explained variance over and above Cloze C and Cloze A combined. The adjusted

(17)

aggregate R

²

of .43 indicated that the three predictors accounted for 43 percent of the shared variance of observed performance on the DST.

Table 18. A Linear Multiple Regression Analysis (Grades 10–12)

Model Predictor R R

²

Adjusted R

²

F Change Sig. F Change

1 Cloze C 2 Cloze C Cloze A 3 Cloze C Cloze A Cloze B

.51 .60

.66

.26 .36

.44

.26 .36

.43

125.59 53.96

48.19 .000*

.000*

Note. *F change is significant at *p < .05

To further examine the unique predictive power of Cloze B against Cloze A and C, the hierarchical regression method was utilized to build another five fresh models.

This method involved a stepwise procedure that manipulated a designated order of predictors (see Qian, 1999, 2002). The changes in the magnitude of the coefficient of determination (R

²

) indicated the predictive power of a given predictor over and above any combination of the other previously entered predictors, which served the purpose in the phase. As shown in Table 19 to 23, both Cloze A and Cloze C contributed .03 (.43 – .40) of the shared variance over and above any combinations of first two predictors. An additional variance of .03 was essentially smaller than that of .07 contributed by Cloze B. Therefore, Cloze B was confirmed to show the most

influential power for the prediction of performance on the DST. Moreover, the results

of hierarchical regression analyses indicated that despite showing approximate beta

coefficients, Cloze A and Cloze C still served as valid predictors and functioned

(18)

and above any combination of the other predictors. Such results may justify the use of the hierarchical approach.

Table 19. A Hierarchical Regression Analysis (Order: Cloze C–B–A)

Model Predictor R R

²

Adjusted R

²

F Change Sig. F Change

1 Cloze C 2 Cloze C Cloze B 3 Cloze C Cloze B Cloze A

.51 .63

.66

.26 .40

.44

.26 .40

.43

125.59 80.75

23.48 .000*

.000*

Table 20. A Hierarchical Regression Analysis (Order: Cloze B–C–A)

Model Predictor R R

²

Adjusted R

²

F Change Sig. F Change

1 Cloze B 2 Cloze B Cloze C 3 Cloze B Cloze C Cloze A

.59 .63

.66

.35 .40

.44

.35 .40

.43

188.85 30.24

23.48 .000*

.000*

(19)

Table 21. A Hierarchical Regression Analysis (Order: Cloze B–A–C)

Model Predictor R R

²

Adjusted R

²

F Change Sig. F Change 1 Cloze B .59 .35 .35 188.85 .000*

2 Cloze B .64 .41 .40 34.43 .000*

Cloze A

3 Cloze B .66 .44 .43 19.43 .000*

Cloze A Cloze C

Table 22. A Hierarchical Regression Analysis (Order: Cloze A–B–C)

Model Predictor R R

²

Adjusted R

²

F Change Sig. F Change

2 Cloze A 2 Cloze A Cloze B 3 Cloze A Cloze B Cloze C

.51 .64

.66

.26 .41

.44

.26 .40

.43

122.60 88.23

19.43 .000*

.000*

Table 23. A Hierarchical Regression Analysis (Order: Cloze A–C–B)

Model Predictor R R

²

Adjusted R

²

F Change Sig. F Change

1 Cloze A 2 Cloze A Cloze C 3 Cloze A Cloze C Cloze B

.51 .60

.66

.26 .36

.44

.26 .36

.43

122.60 56.51

48.19 .000*

.000*

(20)

As shown in Table 20 and Table 21, Cloze B, the most influential predictor, was fitted first into the regression models, followed by the other less powerful predictors.

This approach might trigger “a problem of fitting” predictors as Weisberg (1985, p. 51) maintains. What if the order had been reshuffled by fitting Cloze C first, followed by Cloze A and B accordingly? As Weisberg (1985) indicates: “In multiple regression, if the predictors are correlated the sign of a coefficient may change depending on the other predictors in the model” (p. 65), when the first predictor has been entered, the second will be adjusted, and so will the third. This makes sense because the three correlated significantly at .51, .44 and .55 for Cloze A and Cloze B, Cloze A and Cloze C, and Cloze B and Cloze C, respectively. Since the three predictors were fitted altogether as an aggregate in the final phase, R

²

would remain the same no matter which predictor was entered first.

Entering the best predictor first was intended to test the redundancy of adding another predictor. If entering Cloze A had not significantly increased the accuracy of prediction and “added more relevant unique information” (Glass & Hopkins, 1996, p.

176), only using Cloze B would have sufficed for the model. This justified the utility of a linear multiple regression model using three predictors in lieu of utilizing a simple model with Cloze B as the sole predictor.

On top of the regression analyses for the whole data set, models for individual grades were also constructed. As Table 24 to 26 show, in Grades 10 and 11 all of the three predictors significantly explained and predicted the shared variances in the DST.

In contrast, in Grade 12 Cloze C failed to contribute significantly explanatory power to the model. In other words, for the Grade 12 data, only Cloze A and Cloze B would suffice for the prediction of performance on the DST.

(21)

Table 24. A Linear Multiple Regression Analysis (Grade 10)

Model Predictor β

R R

²

Adjusted R

²

F Change Sig. F Change

1 Cloze B 2 Cloze B Cloze C 3 Cloze B Cloze C Cloze A

.43 .25 .22

.65 .69

.72

.42 .48

.52

.42 .47

.51

89.43 13.51

9.44 .000*

.000*

.003*

Note. *F change significant at *p < .05

Table 25. A Linear Multiple Regression Analysis (Grade 11)

Model Predictor β R R

²

Adjusted R

²

F Change Sig. F Change

1 Cloze B 2 Cloze B Cloze A 3 Cloze B Cloze A Cloze C

.28 .25 .22

.53 .60

.62

.28 .36

.39 .27 .35

.37

43.74 13.40

5.61 .000*

.000*

.020*

Table 26. A Linear Multiple Regression Analysis (Grade 12)

Model Predictor β

^{R R}

²

Adjusted R

²

F Change Sig. F Change

1 2

3 Cloze A Cloze A Cloze B Cloze A Cloze B Cloze C

.30 .26 .09

.45 .52

.52 .20 .27.

.27 .19 .25

.25

28.02 10.51

.87

.000*

.002*

.352

(22)

Summary

At the end of the first section, research questions can be answered. The first research question asked whether the RCT and the DST were equivalent tests. The results suggest that both measures were statistically different in terms of the three criteria by classical equating methods. However, it should be noted that the

differences were small and that local equivalence was identified within specific grade levels. The second research question inquired how the three RCT subtests predicted performance on the DST. Regarding the whole data set, the results show that Cloze B (lexical cohesion) functioned as the most contributive predictor, followed by Cloze A (reference) and Cloze C (conjunction) in order.

Discussion

In the previous section, results of the three-phase analyses have shed some light on the essential properties of the DST and the RCT. In this section, research findings are addressed in detail. In the first subsection, the focus is on how the reliability of the RCT could be established and maintained according to feasible, underlying factors that may contribute to the satisfactory reliability coefficient of the RCT (.80) vis-à-vis that of the DST (.84). In the second subsection, the core of discussion tackles possible arguments for the construction of validity of the RCT. In the third subsection, the main concern rests on potential factors that may account for global inequivalence. In the last subsection, the emphasis is placed on the regression analyses of the RCT as a sole predictor and the predictive power of its three subtests.

The Establishment of Reliability of the Rational Cloze Test

As previously shown in Table 4, while the alpha coefficient for the DST

reached .84, the coefficient for the RCT registered .80. Such satisfactory results for

(23)

the RCT as a reliable test may be accounted for first by the a priori, meticulous control over textual features, multiple-choice option design and test administration.

Text Type

Two out of a total of three RCT texts belonged to exposition, while three out of six DST texts belonged to the same text type. The ratio of expository materials against the remaining text types for both tests was equally 2:3 (67%). In terms of genre

familiarity, this may gain evidence for reliability. Expository materials constitute a significant part of high school English textbooks. When the testees may be assumed to be familiar with this genre, for example, due to years of multiple exposure to reading tasks, their test response may be more reliably elicited and the reliability of a given test may be maintained.

Readability Level

The Flesch–Kincaid indexes showed that the obtained grade levels ranged between 8.1 and 10.8 for the RCT subtexts (with an average of 9.3), which fell within the grade levels of readability for the DST subtexts (between 6.1 and 12, with an average of 8.6). Although the distributions of readability were not controlled to be absolutely identical for the texts of both tests, readability of the RCT texts per se may still to some extent be aligned to an appropriate, readable level with the DST serving as an anchor. In other words, when text materials were controlled for appropriateness of readability, the resulting tests were doable and cases of random guessing or some reliability-breaching factors might thus be attenuated.

Lexical Frequency

Besides a holistic assessment of textual dimensions, the control of vocabulary as the basic building blocks may provide additional evidence for reliability. As

previously presented, an analysis of the Lexical Frequency Profile (LFP) indicated

(24)

for vocabulary levels, 87 percent of words in the RCT texts came from the GSL (i.e., K1 and K2 combined), and that 90.44 percent could be covered in the GSL and the AWL combined. In a much similar fashion, for the DST texts, 87.34 percent belonged to the GSL alone, and 91.62 percent could be included in the GSL and the AWL combined. Despite the presence of a higher percentage of the Off-List words in the RCT texts (9.56%), the LFP measures showed that breadth of vocabulary use was considerably similar for both tests. Had the overall frequency counts been starkly different, the alpha coefficients might have been less approximate, as research

suggests that differences in textual characteristics, e.g., difficulty level, may influence test performance (Alderson & Urquhart, 1985; Jonz, 1991; Klein-Braley, 1997; Sasaki, 2000; Shohamy, 1988; Swain, 1993).

Control of the Multiple-Choice Options

On top of controlling textual features, reliability could be addressed in light of the design of options in the multiple-choice format. Table 27 to 28 present the

vocabulary frequency of the multiple-choice options. The majority of choices in Cloze A were all GSL or Level 1 words. In Cloze B, most of the four options for each item still belonged to the GSL or Level 2 to Level 4 words. As to Cloze C, most options were still sampled from the GSL or Level 1 words. Except Cloze B choices, which spanned across more frequency bands, the majority of the choices could be regarded as basic, frequent words. When item choices per se did not pose overwhelming difficulty, test response could be argued to be more reliably elicited, insomuch as the testees need not resort to wild guessing, random selection or strategic elimination;

instead, they may be encouraged more to resort to overall comprehension and utilize

contextual features and cohesive ties to identify the most appropriate target choices.

(25)

Table 27. An Analysis of Word List for Item Choices Subtest GSL AWL Off-L

Cloze A Cloze B Cloze C

100%

70%

82.5%

0%

15%

17.5%

0%

15%

0%

Table 28. An Analysis of Word Level for Item Choices

Subtest Level 1 Level 2 Level 3 Level 4 Level 5 Level 6

Cloze A Cloze B Cloze C

100%

17.5%

62.5%

0%

40%

17.5%

0%

17.5%

12.5%

0%

20%

5%

0%

5%

2.5%

Besides the overall examination of multiple-choice design, it could be noted that for a given item, word levels for its four choices were controlled to be as approximate as possible, though not unanimously so especially in Cloze B, due to its lexical properties (see Appendixes D and E). Approximating word levels may to some extent prevent from shedding unnecessary clues, compared to cases in which the frequency levels of the choices are too distant from each other.

Test Administration

Processes of test administration may gain evidence for overall reliability. The

researcher’s observation during test sessions may explain and support the results that

the testees’ performance on both tests was reliable. Before giving the tests, the

researcher had clarified the purpose and use of the study, with test formats (i.e., the

DST and the three RCT subtests) clearly specified and demonstrated. The presence

and monitoring of the researcher in each session would also ensure the participants’

(26)

cooperation and responsibility in test taking and may reduce the likelihood of any deliberately uncooperative demeanor (e.g., wild guessing or vandalistic behavior).

The physical environment was a typical high school classroom, with which the testees were already quite familiar. Without disturbance such as intolerable noise, the participants’ competence could be more reliably triggered, though not completely due to some unaccountable and unobservable factors. The allocation of time would also contribute to reliability. A typical classroom period of fifty minutes was sufficient for all the testees. On average, most participants completed the tests within or around forty minutes.

The Establishment of Validity of the Rational Cloze Test

As presented in the previous section, validity of the RCT could be maintained based on analyses of content validity and concurrent validity. For the RCT as a measure of cohesion, the rational approach to the a priori identification of cohesive ties may assume the most critical role in establishing test validity. With meticulous identification and classification of cohesive ties, followed by careful control and selection of target choices and distractors, the resulting RCT items may thus function satisfactorily in eliciting assumed, desired constructs of cohesion for successful closure.

However, research utilizing think-aloud protocols on the rational cloze also suggests multiple information sources and complex test-taking processes triggered in task completion (Sasaki, 2000; Storey, 1997; Yamashita, 2003). The test-takers may not necessarily solve an item following specific routes previously expected by the researcher. For example, even though the lexical cohesive ties had been identified and the testees were supposed to utilize intersentential reading to select the target choice in Cloze B, they might also gain relevant clues intrasententially around the blank.

This echoes to what Halliday and Hasan (1976) maintain as the subtlety and

(27)

complexity of cohesive mechanism especially for collocational cohesion, one of the major item type in Cloze B.

Above all, despite the presence of unexpected test-taking processes, the

aforementioned studies using think-aloud techniques to gain straightforward evidence for construct validity imply that most rational items can activate expected test

response and that the rational cloze can function as a recommended, sufficient

measure of text-level knowledge. With such implications, the construct validity of the RCT may secure further support.

Global Inequivalence

The findings showed that as a whole the DST and the RCT failed to meet the rigid requirements of classical equating methods. Global inequivalence was confirmed that both tests were statistically different in means, variances and inter-form

covariance. Such results could be expected as Henning (1987) maintains that “it is almost impossible to satisfy conditions of equality of means, variances and

correlations” (p. 81). In this subsection, factors contributing to global inequivalence are expounded based on feasible differences in text type, text topic, text length and test response.

Text Type

The varieties of genres may impose disparate degrees of difficulty onto tasks involving reading comprehension. While the six DST subtexts included three different types, with expository texts constituting the majority, the three RCT subtexts merely involved two types, with exposition remaining the majority. Based on

schema-theoretical perspectives, each text type may require different levels of memory schemata and strategies that make the texts per se as readable and

interpretable as possible (Bobrow & Norman, 1975). Arguably, the more the text types

(28)

finding that the DST was more difficult than the RCT, evidenced in statistically significant differences in means.

Text Topic

Considerable studies have suggested that the readers’ background knowledge and familiarity with a given text topic correlate positively with reading comprehension (Asher, 1980; Barry & Lazarte, 1998; Carrell, 1987; Hudson, 1982; Jenkins & Dixon, 1983; Nagy, Anderson, & Herman, 1987; Paribakht & Wesche, 2000). The DST subtexts involved six topics, which were twice as many as those of the RCT subtexts.

This may render the DST more difficult and challenging. When a testee is more familiar with various topic-specific knowledge, for example, through extensive reading and exposure to multiple genres over time, cumulative reading experience may foster the test-taker to exercise more sophisticated command when a test involves more topics.

Text Length

The length of the text materials may impose difficulties on the test-takers, with longer passages typically assumed to be more demanding and difficult to tackle.

Nonetheless, although the RCT outweighed the DST in the total number of words (2008 vs. 1541), it appeared that longer texts did not so much cause overall difficulty of the RCT as lower its general reading difficulty by providing more contextual, redundant information that might facilitate intelligent, strategic guessing and

inferencing processes. Therefore, text length may be a less critical variable than text type and topic familiarity in determining text difficulty for both gap-filling tests.

Test Response

Differences in test-response patterns may activate some test-taking strategies, which may contribute to global inequivalence. One of such strategies is

“test-wiseness,” defined as “a subject’s capacity to utilize test characteristics and

(29)

formats of the test-taking situations to receive a higher score” (Millman, Bishop, &

Ebel, 1965, p. 707). Resorting to elimination may be triggered to facilitate the

gap-filling process (Alderson, Clapham, & Wall, 1995; Storey, 1997). For each of the six DST subtests, so long as one of the five items has been solved, choosing among the rest alternatives to restore the remaining items appears to be relatively easier.

Considered from the perspective, the item-restoration process appears to be more interdependent in the problem-solving process for the DST. In contrast, despite the presence of similar test-taking mechanisms at work, items in the RCT may require slightly different test-response patterns. Each item is provided with an independent set of four choices, and whether a specific item is solved correctly may not decidedly determine the next item, which is given another set of alternatives. The

item-restoration process in the RCT, therefore, appears to be less interdependent.

Another factor influencing test-response may be differences in information processing. The RCT items were blanks either in the form of a single-word or phrasal unit, while the DST items were presented in the form of a complete sentence. It has been suggested that multiple information sources are required for gap-filling tests, especially for items at the intersentential level (Abraham & Chapelle, 1992; Sasaki, 2000; Yamashita, 2003). Although items in both tests require intersentential reading, sentence processing in the DST may not only require knowledge of cohesion, but also demand comprehension of overall textual coherence for successful closure. This may account for the case in which the tests were globally inequivalent, with the

sentence-processing-oriented DST slightly more difficult and cognitively more

demanding than the word-or-phrase-processing-based RCT. Although cohesion can be

argued to be at work in restoring missing elements for both tests, such differences

between word/phrase processing and sentence processing may be more decisive and

(30)

Regression of the Discourse Structure Test on the Rational Cloze Test and the Subtests

In this subsection, based on the hypothesized, shared construct of intersentential cohesion, the relationship between the DST and the RCT is discussed and examined in terms of the coefficient of determination. In the last subsection, the significance of the three RCT subtests in predicting and explaining performance on the DST is expounded according to past research.

The General Predictive Power of the RCT

As previously shown, .43 of the shared variance in the DST as a criterion could be accounted for by the RCT as a predictor. For knowledge of cohesion alone to predict performance on the DST, the R

²

of .43 could be regarded as satisfactory because the residual regression of .57 (1 – R

²

) could be explained by other factors.

Some coherence-related dimensions could be latent variables which were not sampled and tested in the RCT. As Celce-Murcia and Olshtain (2000) claim, knowledge of coherence is a global, top-down approach to text organization:

Coherence contributes to the unity of a piece of discourse such that the individual sentences or utterances hang together and relate to each other. this unity and relatedness is partially a result of a recognizable organizational pattern for the proposition and ideas in the passage, but it also depends on the presence of linguistic devices that strengthen global unity and create local connectedness. (p. 8)

The RCT may assess cohesion as a local mechanism as to how missing elements can be restored via cohesive chains stretching over neighboring sentences. However, knowledge of coherence as a global mechanism may operate throughout the whole discourse and thus may not be directly measured and reflected in the RCT.

Another coherence-oriented feature not involved in the RCT is the notion of

thesis statements. An awareness of the functional aspect of the thesis statement

constituting “macrostructure” may contribute to overall coherence and facilitate

(31)

reading comprehension (Lee, 1998). Take Item 16 and Item 22 in the DST for example. They were expected to measure whether the testees were able to judge which sentence served as the best thesis statement:

Starting around 4,000 B.C., traditional Chinese brush painting has developed continuously over a period of more than six thousand years. 16

(DST D) Many researchers have been interested in whether or not an individual’s birth order has an effect on intelligence. One of the first studies was carried out in the Netherlands during the early 1970s. 21 The test was called the “Raven,”

which is similar to the I.Q. test. The researchers found a strong relationship between the birth order of the test takers and their scores on the Raven test.

22 (DST E) Still another critical feature not tested in the RCT is knowledge about topic sentences. As shown below, Item 9, 12, 17 and 28 in the DST were intended to measure the knowledge of how topic sentences functioned to unfold the texture of discourse:

9 Even though her friends teased her about her awkward invention attached to a streetcar, Mary didn’t give in to peer pressure.

(DST B) The first two months were encouraging. 12 When I cooked dinner, he would take a walk with our daughter; a few time, Derek miraculously found his way home when Amy got lost. To reward him, we allowed him to eat at the table or to sleep with us.

(DST C) During the 1

^st

century A.D., the art of painting religious murals gradually

gained in prominence, with the introduction of Buddhism to China and the consequent building of temples. 17 For example, paintings of historical characters and stories of everyday life became extremely popular. Besides historical figures, landscape painting was also common in Chinese brush painting. By the 4

^th

century, this particular type of painting had already established itself as an independent form of expression.

(DST D)

28 It is instant, traveling from point to point. If you don’t print it out,

(32)

time is not important.

(DST F) Even though knowledge of cohesion could still be argued to be indispensable for successful performance on the DST, notions constituting coherence, such as thesis statements and topic sentences, appeared to operate simultaneously in the test-taking process. Had other test variables assessing such notions been incorporated as

predictors along with the RCT, the resulting R

²

could have been expected to be higher than .43.

The Predictive Power of Cloze A, Cloze B and Cloze C

Research on vocabulary acquisition has indicated high, positive correlations between depth of vocabulary knowledge and general academic reading

comprehension (Qian, 1999, 2002). Among the major components of depth of vocabulary knowledge, collocation is typically referred to as an important dimension (Haastrup & Henriksen, 2000; Qian, 1999, 2002; Read, 1994). On theoretical ground, it stands to reason that Cloze B (lexical cohesion), embracing collocation as the primary domain, served as the most influential predictor on the academically-oriented DST. Moreover, Gutwinski’s (1976) analysis of paragraph-level discourse identified different patterns between grammatical cohesion and lexical cohesion, with the increase of and reliance on lexical cohesion and, by contrast, the decrease of grammatical cohesion. In the present study, texts of the DST and RCT involved paragraph-level reading, which may account for the significance of Cloze B (lexical cohesion) in discourse comprehension and may lend support to Gutwinski’s (1976) observation.

Cohesion is generally categorized as grammatical cohesion and lexical cohesion,

with Cloze A (reference) and Cloze C (conjunction) leaning toward the grammatical

end and Cloze B the lexical. While Cloze B involves the “open-ended” domain, Cloze

(33)

A and C tend to include “closed systems” (Halliday & Hasan, 1976, p. 303). This could be evidenced in the beta weights for Cloze A (.23) and C (.22), which showed exceedingly small and nearly negligible difference. In light of school English leaning, while the open-ended, depth-oriented systems may be still expanding and coinciding with the growth of learners’ general proficiency, the closed systems may be acquired within a fixed, specific period of time, for example, through accumulative reading experience and exposure to comprehensive input. Such closedness may explain the similarity between Cloze A and Cloze B in the predictive power.

Another interesting finding worth discussion is the exclusion of Cloze C from the best-fitting regression model for the Grade 12 data. Halliday and Hasan (1976)

remark on the status of conjunction that it is “on the borderline of the grammatical and the lexical” (p. 303). Brown and Yule (1983) suggest that despite “the absence of formal markers,” especially conjunctive markers such as and, but or so, the reader may still construct and arrive at coherent interpretation of a given text under a specific context (p. 192). Generally speaking, when students reach the 12th-grade level, despite the absence of explicit conjunctive markers, they may still succeed in making inferences about logical relationships between propositions in a given discourse (Hagerup-Neilsen, 1977). Based on language development, it can be argued that learners at this level may develop more sophisticated knowledge of cohesion. Rather than resorting solely to explicit conjunctive signals, they may rely more on knowledge of reference and lexical cohesion, especially the latter, to construct general coherence, inferences, and comprehension of a text. On the other hand, for the 10th and 11th graders, whose knowledge of cohesion may be still fledging, the three primary

components of cohesion may remain indispensable and exert difference influences for

these graders to tackle the tasks.

(34)

Summary

This section provided evidence and arguments to explain research findings on the establishment of reliability and validity of the tests, global test inequivalence and the predictive values of the RCT and the subtests regarding performance on the DST.

Conclusions, pedagogical implications, limitations of the study and suggestions for

future research will be highlighted in the next chapter.