Dispersion Measures for Lexical Bundles - RESEARCH - ORIENTED : help writers to structure their

Chapter 2 Literature Review

I. RESEARCH - ORIENTED : help writers to structure their activities and experiences of the real world

4.1 Dispersion Measures for Lexical Bundles

In Biber et al. (1999) and many follow-up studies (e.g., Cortes 2002, Biber et al.

2004, Kim 2009), a dispersion threshold is set at occurring in at least five different corpus texts to guard against individual idiosyncracies and local repetitions. However, the empirical data in the present study expose some problems of adopting text counts as the only dispersion measure in the identification of lexical bundles.

First, text counts and relative frequencies are highly correlated with each other, as illustrated in the following figures.²³

23 In the news subcorpus, only three four-word sequences pass the frequency threshold. They are not included in the following discussions.

Figure 4.1. Correlations between text counts and relative frequencies.

(The upper left panel is for three-word spoken sequences, the upper right one is for three-word news sequences, and the lower one is for four-word spoken sequences.)

The correlation coefficients for the three sets of word sequences are 0.80, 0.98, and 0.82, respectively. As can be seen from the above figures, almost all the word sequences that pass the frequency threshold also pass the text count threshold. The text count threshold screens out only 26 (out of 1,024) three-word spoken sequences and 2 (out of 143) four-word spoken sequences, and no word sequences in the news subcorpus are excluded here. This suggests that the text count threshold may be of little practical use, and Conrad and Biber (2004) have noticed this.

Second, among sequences that pass the text count threshold, some are local repetitions that simply reflect the immediate topic of the discourse and are functionally/pragmatically uninteresting. Examples include women de haizi ‘we

POSSESSIVE.MARKER child; our children’ and junshi fayanren shi ‘military spokesman office’. Sequences like these do occur in several corpus texts, so they pass the text

count threshold; however, they are absent in most of the corpus texts (see also Partington and Morley 2004). A more sensitive dispersion measure is needed to filter them out.

In view of the above problems with text counts, DP appears to be a more reliable dispersion measure. First, as shown in the following figures, the correlation between DP and relative frequencies is not so strong as that between text counts and relative frequencies.

Figure 4.2. Correlations between DP and relative frequencies.

(The upper left panel is for three-word spoken sequences, the upper right one is for three-word news sequences, and the lower one is for four-word spoken sequences.)

Since the correlation coefficients (i.e., -0.36, -0.19, and -0.28, respectively) are much lower, DP can be treated as independent of relative frequencies. Second, DP is more sensitive and can filter out word sequences that pass the text count threshold but have a skewed distribution in the news subcorpus. For instance, although junshi fayanren shi ‘military spokesman office’ occurs in 50 newswire texts, its DP value is rather

high (i.e., 0.899). We can set a reasonable DP threshold to exclude such word sequences from further analysis.

To set a reasonable DP threshold, we manually check whether word sequences that pass the text count threshold are of functional/pragmatic value. Take three-word spoken sequences, for example. There are 998 three-word spoken sequences passing the text count threshold. Among these word sequences, five of them have a DP value falling between 0.80 and 0.89, and all the sequences here are either of little functional/pragmatic value (e.g., ban wo chengzhang ‘accompany me grow.up’ is simply a TV or radio program title) or have a very low text count (e.g., zhen de a

‘really’ occurs in just exactly five corpus texts). Then, 17 word sequences have a DP value falling between 0.70 and 0.79, and only two can be regarded as functionally/pragmatically significant (i.e., man hao de ‘very good’ and zuo bu dao

‘cannot do it’). Such a procedure is adopted to analyze the remaining word sequences, and the results are presented in the following table.

Table 4.1. Numbers of word sequences at each DP value.

(The first row presents the numbers of word sequences that pass the text count

threshold. The numbers in the parentheses are how many word sequences there can be regarded as functionally/pragmatically significant.)

 text count threshold 998 101 141

0.90-0.99 0 0 0

0.80-0.89 5 (0) 1 (0) 1 (0)

0.70-0.79 17 (2) 0 (0) 4 (2)

0.65-0.69 41 (5) 0 (0) 13 (1)

0.60-0.64 43 (26) 0 (0) 8 (6)

As can be seen from the above table, almost all the word sequences with a DP value higher than 0.65 are functionally/pragmatically uninteresting. As the DP values

lower, more word sequences worth our attention emerge. Therefore, it is decided that word sequences with a DP value higher than 0.65 will be excluded from further analysis. Though a few potential bundles are filtered out as a consequence, many word sequences that seem to be just local repetitions can be efficiently eliminated without manual interventions. The DP threshold suggested here, which echoes Gries’

(2008b) observation that a lexical item with its DP value falling between 0.4 and 0.8 (e.g., definition: 0.795; formal: 0.708; properly: 0.625; house: 0.453) is certainly known to all native speakers and advanced learners, can also be tried in future studies on lexical bundles.

Although previous studies usually adopt text counts as the only dispersion measure, lexical bundles previously identified and related findings are still considered to be solid. As can be seen from the above table, the number of word sequences filtered out through the DP threshold is actually not high: i.e., approximately 6% of the three-word spoken sequences, only one three-word news sequence, and approximately 13% of the four-word spoken sequences. Besides, still a few word sequences successfully pass the DP threshold but fail the text count threshold.

Therefore, the text count threshold is not abandoned in the present study, and the DP threshold is treated as complementary to it.

Though the DP threshold in the present study is still arguably arbitrary, our decision is based on a careful analysis of the data. However, even with the help of a more reliable dispersion threshold, some word sequences that do not serve important functions remain in the data. To screen them out, we may resort to quantitative measures that evaluate the internal association between the elements of a word sequence.

在文檔中中文的常用詞串 (頁 81-86)