Chapter 2 Literature Review
I. RESEARCH - ORIENTED : help writers to structure their activities and experiences of the real world
4.3 Overall Results of Lexical Bundles in Chinese
The quantitative measures have provided a candidate list of lexical bundles in Chinese. These potential bundles fulfill all the quantitative criteria, including:
(i) reaching the frequency threshold of occurring at least twenty times per million words,
(ii) reaching the text count threshold of occurring in at least five corpus texts, (iii) getting a DP value no higher than 0.65, and
(iv) reaching the G threshold at zero.
A manual analysis is still needed to exclude word sequences which are not readily interpretable in functional/pragmatic terms.25 Sequences that remain are identified as lexical bundles. The following table summarizes the whole procedure.
Table 4.2. Numbers of word sequences passing each threshold.
(The icon stands for passing a threshold.)
Three-word Types of sequences 165,970 3,044,598 156,078 2,793,826
frequency threshold 1,024 101 143 3
text count threshold 998 101 141 3
DP threshold 935 100 123 3
G threshold 843 98 118 3
manual exclusion 643 87 105 3
In line with expectations, while there are a large number of sequence types, only a tiny proportion of them are frequently used. It is strikingly evident that very few four-word sequecnes in news pass the frequency threshold. It is also clear that conversations feature a much wider range of different lexical bundles than newswire
25 See Section 3.4 for the exclusion criteria.
texts. However, both in conversation and in news, the type number of three-word bundles is much larger than that of four-word bundles.
As for the proportion of corpus data covered by lexical bundles, conversation is also higher than news. The following table presents the percentages of words in lexical bundles.
Table 4.3. Percentages of words in lexical bundles.
(The percentages in the parentheses are calculated without removing punctuation marks.)
Spoken News
Three-word 13.26% (10.68%) 1.17% (0.99%)
Four-word 2.01% (1.62%) 0.03% (0.03%)
Total 15.27% (12.30%) 1.20% (1.02%)
The same tendencies are also observed in English. In spontaneous speech (e.g., face-to-face conversations), speakers face real-time pressure. Therefore, a common strategy is to rely on frequent repetitions of prefabricated chunks such as lexical bundles (Biber et al. 2004, Johnstone 2002, Tannen 1982).
The following figure demonstrates the frequency distributions of lexical bundles.26
26 As shown in Table 4.2, there are only three four-word bundles in news. It is inappropriate to draw a boxplot with only three data points. To make the shapes of the boxes clear, lexical bundles occurring more than 200 times per million words are not included. All of them are three-word bundles in conversation.
Figure 4.4. Frequency distributions of lexical bundles.
(The boxes from left to right are for three-word bundles in conversation, three-word bundles in news, and four-word bundles in conversation. The numbers on the vertical axis are frequencies per million words.)
Some lexical bundles occur with a very high frequency. As shown in the above figure, most of them are three-word bundles in conversation. The most common bundle in each set is as follows:
(i) three-word bundle in conversation: shi bu shi ‘A-not-A yes-no QUESTION’ (1,317 times per million words),
(ii) three-word bundle in news: shi yi ge ‘COPULA one CLASSIFIER’ (181 times), (iii) four-word bundle in conversation: mei yi ge ren ‘every one CLASSIFIER
person; everyone’ (159 times),
(iv) four-word bundle in news: you hen da de ‘have very large DE’ (24 times).
As some lexical bundles occur much more frequently than others, the Shapiro-Wilk normality test shows that the frequency distributions of the lexical bundles do not follow normal distributions. Thus, the Mann-Whitney test is performed on the means in the following table.
Table 4.4. Means of relative frequencies (per million words) of lexical bundles.
Three-word spoken Three-word news Four-word spoken
55.4 37.9 38.6
Many three-word bundles in conversation occur with a very high frequency. As shown above, the three-word bundle with the highest frequency in conversation occurs approximately seven times more often than that in news. Therefore, it is not surprising that the relative frequency mean of three-word bundles in conversation is the highest.
It is evident that three-word spoken bundles occur more frequently than three-word news bundles (p = 0.001), but the difference between three-word and four-word spoken bundles is not statistically significant (p = 0.06).
There are two dispersion measures (i.e., text counts and DP) in the present study.
Before the dispersion of lexical bundles is discussed, the text counts of lexical bundles need to be normalized against the text numbers of the subcorpora (i.e., 113 conversation texts and 13,800 news texts). For example, shi bu shi occurs in 93 conversation texts, so its normalized text count is 0.823 (i.e., 93/113).
Just like frequencies, text counts also have skewed distributions. Some lexical bundles occur in a much larger number of texts than others, as the following figures show.
Figure 4.5. Quantile-quantile plots for text counts of lexical bundles.
(The upper left panel is for three-word spoken bundles, the upper right one is for three-word news bundles, and the lower one is for four-word spoken bundles.)
With skewed distributions, the Mann-Whitney test is performed on the text count means in the following table.
Table 4.5. Means of text counts (in percentages) of lexical bundles.
Three-word spoken Three-word news Four-word spoken
15.9% 1.54% 12.2%
The huge difference between three-word spoken and news bundles achieves statistical significance (p < 2.2e-16), and the difference between three-word and four-word spoken bundles is also statistically significant (p = 0.039). It appears that spoken bundles tend to occur in a larger proportion of texts than news bundles do.
However, DP values show an entirely different tendency. The following table presents the DP means of lexical bundles.
Table 4.6. Means of DP values of lexical bundles.
Three-word spoken Three-word news Four-word spoken
0.40 0.15 0.42
Only the DP values of four-word spoken bundles follow a normal distribution, so the Mann-Whitney test is still run on the DP means. The DP of three-word news bundles is lower than that of three-word spoken bundles (p < 2.2e-16), but the difference between three-word and four-word spoken bundles is not statistically significant (p = 0.263). Contrary to the finding based on text counts, the DP distributions show that three-word news bundles are more evenly dispersed than three-word spoken bundles.
The reason why text counts and DP values display opposite patterns may be that the former measure is easily susceptible to text lengths.27 On average, each conversation text contains 4,069 tokens, which is almost ten times more than the
27 There are 113 texts in the conversation corpus, which contains 459,833 tokens; there are 13,800 texts in the news corpus, which contains 6,475,872 tokens.
average token number of a news text (i.e., 6,475,872/13,800 = 469.2). Now consider the following toy example, which is quite similar to the situation in the present study:
Figure 4.6(a). Distribution of a lexical bundle a in the subcorpus A.
(The thin bars stand for text boundaries. The thick bars stand for bundle occurrences.)
Figure 4.6(b). Distribution of a lexical bundle b in the subcorpus B.
(The thin bars stand for text boundaries. The thick bars stand for bundle occurrences.)
The texts in the subcorpus A is more than twice as long as those in the subcorpus B.
There are four texts in the subcorpus A, and the bundle a occurs in 75% of the texts.
There are ten texts in the subcorpus B, and the bundle b occurs in merely 30% of the texts. However, if we evenly divide both subcorpora and calculate the DP values for a and b (see Section 3.3.), then it is evident that the two bundles will be equally well-dispersed. In the present study, the text length difference is even more enormous.
As a consequence, it comes as no surprise that the text count difference between three-word conversation and news bundles is dramatic (i.e., 15.9% vs. 1.54%). Based on DP values, three-word news bundles actually seem to be more evenly dispersed than three-word spoken bundles. The conflicting findings here also suggest that DP is needed to complement text counts in the identification of lexical bundles.
The following figure shows the distributions of the word association measures.
The G means of three-word spoken bundles, three-word news bundles, and four-word spoken bundles are 3.19, 3.76, 3.50, respectively. The means are all above three, and this reminds us that word sequences with the MI score higher than three are of greater use for second language learners at beginning and intermediate levels (McEnery et al.
2006: 217). The role of Chinese bundles in second language learning needs to be further explored, but this is beyond the scope of the present study.
Figure 4.7. G distributions of lexical bundles.
(The boxes from left to right are for three-word bundles in conversation, three-word bundles in news, and four-word bundles in conversation.)
The G values of three-word bundles in conversation do not follow a normal distribution, so the Mann-Whitney test is still applied to the G means. The difference between three-word spoken and news bundles achieves statistical significance (p = 0.002), and that between three-word and four-word spoken bundles is also statistically significant (p = 0.035). That is, the components in news bundles tend to be associated more closely than those in spoken bundles, and the components in longer bundles tend to be associated more closely than those in shorter bundles.
4.4 Summary
In the process of identifying lexical bundles, the present study adds two quantitative thresholds to the Biberian approach. First, DP reflects the dispersion of word sequences more accurately than text counts and weeds out some local repetitions that narrowly pass the text count threshold. Second, the word association measure G
filters out many word sequences that contain frequently occurring function words but do not have identifiable functions. These two measures are fairly independent of relative frequencies and text counts and complement the Biberian approach. However, the quantitative measures cannot screen out all semantically/pragmatically vague word sequences, so manual interventions are still needed. In the long run, 838 lexical bundles in total (i.e., 643 three-word spoken bundles, 87 three-word news bundles, 105 four-word spoken bundles, and 3 four-word news bundles) are identified for further analysis.
Echoing previous findings in English (e.g., Biber et al. 1999), the present study shows that lexical bundles in different text types display different distributional patterns. Conversations feature a much wider range of different lexical bundles than newswire texts. As for the proportion of corpus data covered by lexical bundles, conversation is also higher than news. These reflect that in spontaneous speech, speakers are under real-time pressure and thus rely more heavily on prefabricated expressions such as lexical bundles. Regarding the dispersion of lexical bundles, the DP distributions show that news bundles are more evenly dispersed than spoken bundles.
It is also found that news bundles achieve stronger internal associations than spoken bundles. The G means of spoken and news bundles fall around three, which has been argued to be a critical value showing that these multi-word combinations are important in language acquisition (McEnery et al. 2006). The strong association between the elements of lexical bundles confirms that lexical bundles are not merely accidental combinations of high-frequency words (Conrad and Biber 2004). What knits these high-frequency items into lexical bundles is their essential communication functions in language use. The following two chapters will explore the functions of lexical bundles in Chinese and reveal more genre differences.
Chapter 5
Lexical Bundles in Conversation
In this chapter, we will focus on the form and function of lexical bundles in conversation (Sections 5.1 and 5.2). We will also address some intriguing issues, such as the interaction between structural and functional categories, the interaction between quantitative measures and discourse functions, and similarities and differences between Chinese and English in their use of spoken bundles (Section 5.3).