• 沒有找到結果。

CHAPTER II LITERATURE REVIEW

2.2 Methods of Analyzing ESL Learners' Miscollocations

2.2.3 Concordancers and MI Measures

With the promotion of lexical statistics (Church and Hanks 1989) and better-programmed concordancers, other measures like MI (mutual information) have allowed researchers to broaden the scope of inspecting a word's collocates up to five words (Kilgarriff et al. 2004). Instead of reading concordance after concordance, a list of salient collocates for a keyword could be conveniently summarized.

Lin (2010) conducted an informative study to compare the Verb-Noun miscollocations between Chinese and Taiwanese ESL learners by adopting CLEC (Chinese Learner English Corpus), approximately 3.4 million words, and a Taiwanese ESL learners’ corpus, around 1.8 million words. First, she extracted all the Verb-Noun

combinations from the above two corpora with the software Antconc and MonoConc Pro. Then, comparing these combinations with the BNC corpus by using another

program Perl, Liu identified those combinations not overlapped in the BNC. Finally, the researcher performed a manual check of all the potentially erroneous Verb-Noun collocations by means of the consultation with dictionaries and online websites. Her result showed that 210 types of miscollocations were detected in the Taiwanese ESL learner corpus, while 268 in CLEC, and about 10% of the miscollocations appeared overlapped in the two corpora.

The studies above from 2.2.1 to 2.2.3, though all offering insightful results and discussions, seemed to leave room for improvement in the following realms.

As for the data extraction procedures, except for Lin (2010), the past studies all required too much labor during data extraction process and might not really be feasible for future academic reproduction. Even if in Lin's (2010) study, certain semi-automated method was adopted (a software Perl was applied to filter out overlapping V-N collocations in both the learner corpora and BNC to extract those erroneous V-N ones from the learner corpora), most of the procedure was still manual.

Also, as notified before, the potential miscollocations in the previous research

were double-checked manually with many kinds of resources such as the The BBI Dictionary of English Word Combination, Oxford Collocations Dictionary, Oxford

Advanced Learner’s Dictionary, etc. Nevertheless, the native examples from a

well-organized corpus like the BNC were left without further consultation. This could result from the issue of labor and time constraint that if the V-N collocations from the BNC were to be manually extracted for comparison, it might take too much time and human resources.

Third, due to practicality constraint, most of the analyses and decisions in the past research were conducted by the researchers alone, with the suggestions from hard-copy as well as online dictionary references. This, though the results were examined rigorously later on, still poses a question of human judgment and fatigue concerns.

Finally, with the assistance of technology, the sizes of different corpora around the world are increasing day by day. Once new sources of data are to be incorporated into current corpora, new results or different analyses could be promisingly awaited.

CHAPTER III

METHOD

This section presents the tool, data source, and planned extraction coupled with analysis procedures for this study. First, the powerful online platform, The Sketch Engine, is introduced accompanied with its two major functions for the target research.

Then, the adopted corpora, both general corpora as well as ESL learner ones, are discussed. Finally, the semi-automated method and final judgment process are explained.

3.1 Instruments-The Sketch Engine

First utilized in the compilation of the Macmillan English Dictionary (Rundell 2002), and debuted at Euralex 2002 (Kilgarriff and Rundell 2002), word sketches are a one-page summary of a word's features, both grammatically and collocationally, drawing on corpus-based data in an automatic manner (Kilgarriff et al. 2004: p. 1).

The Sketch Engine (SKE, also known as the Word Sketch Engine, would be referred to

as SKE henceforth) is an innovated corpus query system that demonstrates word sketches, grammatical relations, and a distributional thesaurus (Huang and Hong

2006). With its clear and constantly renovated online platform, SKE has been gaining more and more attention these days.

In response to the ever-changing era of hi-tech advancement, the invention of SKE copes with the ensuing challenges and develops distinctive functions.

First of all, as witnessing the introduction of Gigaword (1000M word corpus) by The Linguistic Data Consortium (http://www.ldc.upenn.edu/), researchers around the

world sensed that the traditional interface of concordancers could not handle such an amazing amount of data any more (Kilgarriff and Grefenstette 2003). Instead of just reading the lines of co-occurrences, more systematical arrangements were surely needed. As a result, Word Sketch, examining a word from its varied grammatical contexts, was designed with Manatee, a state-of-the-art CQS (Corpus Query System), and is now able to demonstrate a set of up to 27 grammatical relations connected to a headword (Kilgarriff et al. 2004) (Figure 3.1).

Currently, The Sketch Engine, with its delicately designed tagging functionality, included word sketch references, thesaurus search, sketch differences, and many other practical uses for its repertoire of services. When keying in a word, users can surprisingly discover that not only are concordancing lines available but all the parts of speech of a target word are delineated like its DNA combinations. Without a

traditional repetition of searching for certain collocation types of a headword time after time, this convenient one-touch-for-all-results interface saves much time and sweat undoubtedly.

Figure 3.1 An Example of the Word Sketch Function on the SKE Website

Now, serving as a commercial product, the SKE website provides researchers, teachers, and students with a platform of learning and academic study. As Kilgarriff et al. (2004) claimed, a multi-word searching function was being tested. Maybe in the near future, a "multi-word sketch" would be announced as a breakthrough. The following are introductions of two main functions on the SKE, which would be

needed for this study-Corpus Creating and Sketch Diff.

3.1.1 Corpus Creating Function

Generally, there are three basic functions on the SKE-access to large corpora, from 30 million to 10 billion words in up to 42 languages, a WebBootCat category, which allows members to build their own instant corpus by automatically retrieving all the keywords from a website, and a Corpus Creating function, for personal corpus data compilation.

As its name denotes, the Corpus Creating function allows users to upload their own data onto the platform for further alternative analyses by applying the tools on the SKE website. Once a corpus is set up on the SKE, several kinds of uses can be executed like corpus querying, wordlist compiling, word sketch, thesaurus, and sketch-diff, the central one which is elaborated in detail in 3.2.2. The target Chinese

ESL learner corpora for this study were uploaded onto the SKE so that a semi-automated comparison could be carried out with its technical assistance.

3.1.2 Sketch Diff Function

The Sketch Diff function, with Diff standing for difference, is developed by Kilgarriff et al. (2004) to display the collocational discrepancy between two synonyms (Figure 3.2). When learners come across two seemingly similar words like intelligent and clever, they often inevitably wonder how they could use them correctly

in real situations. As observed in Figure 3.2, we can tell that certain distinctive adjectives just accompany intelligent or clever in a straightforwardly

Figure 3.2 The Sketch Diff Interface Showing intelligent and clever in the BNC

different manner, which used to be an unimaginable power of traditional thesaurus dictionaries or even online resources. For example, sensitive/ bright / charming and intelligent are often co-collocates, while cunning/ brave/ bloody and clever tend to be

other frequent combinations.

For Verb-Noun collocations, if speak and tell in the BNC are taken for example, by applying the Sketch Diff option, it is clearly displayed that for words to accompany speak and tell as Objects in sentences, story shows an overwhelming frequency of

1309 times for tell, while 0 time for speak. On the other hand, English appears 414 times with speak while none with tell (Figure 3.3). In other words, tell a story/ lie/ tale are strongly related pairs when speak English/

words/ languages are closely connected ones.

The four numbers next to each collocate respectively indicate its frequency as well as salience scores with the first and second keywords (in this case speak and tell). A parallel contrast, therefore, can be quickly and scientifically observed.

Figure 3.3 Collocates in the Object position with speak and tell in the BNC

The use of Sketch Diff in this study, however, is not to compare two words in the same corpus. Instead, with the unique functionality of Word Sketch and Sketch Diff on the SKE, the author plans to retrieve the collocates of a keyword from two different corpora, i.e., native speaker English corpora vs. ESL learner corpora, and compare them with each other, all of which were automatically executed with Sketch Diff. In this way, human misjudgment and time-consuming issues could be avoided, and a more comprehensive overview as well as a better systematic comparison between native and non-native English speakers' collocation uses established. The author's purpose of utilizing the Sketch Diff functionality in an alternative manner is elaborated in section 3.3, Data Extraction.

3.2 Corpora

3.2.1 The British National Corpus

The native reference corpus adopted in this study is the BNC, British National Corpus. Boasting more than 100 million tokens, the BNC is a comprehensively

balanced corpus consisting of both written (90%) and oral (10%) input, with a wide variety of sources from newspapers, university essays, to business meetings, and informal interviews, etc.

There are four main features about the BNC. First, it is mostly comprised of modern British English. Second, instead of a chronological documentation of the English language, the BNC only selects historical linguistic records during the late twentieth century. Third, covering various styles and subject matters, the BNC is not specifically restricted to certain type of domain, but rather a comprehensive database.

Fourth, to prevent the tendency of collecting texts from repeated idiosyncratic styles, the BNC ensures that its sampling of input can be as multifarious as possible, setting up maximums for different lengths and types of sources like single or multiple authors, shorter or longer texts.

3.2.2 CLEC, SWECCL, JCEE, The Taiwanese Learner Corpus

The Chinese ESL learner corpora, on the other hand, are composed of four major parts-CLEC (Chinese Learner English Corpus, 1.0), SWECCL (Spoken and Written English Corpus of Chinese Learners. 1.0 and 2.0), JCEE (Joint College Entrance Examinations) Testees Corpus, and The Taiwanese Learner Corpus. The total is 7.3 million words.

The first two corpora are based on the input of Chinese ESL learners in Mainland China. CLEC (Chinese Learner English Corpus) is a large-scale ESL learner corpus compiled by professors Gui and Yang. Comprised of about 1 million words produced by high school and college students, it is frequently adopted for research purposes for its balanced tagging labels of 61 types of errors, including up to 1288 tokens of Verb-Noun miscollocations ready for analysis (Zhou 2005; Li 2005). As to SWECCL (Spoken and Written English Corpus of Chinese Learners), it is a project led by Wen et al., and is so far the largest Chinese ESL learner corpus in Mainland China. With a size of 3.5 million words, the SWECCL corpus possesses both written and spoken data. Since the BNC, the authors' native reference corpus for this study, is mainly comprised of written input, only the written data from the SWECCL, a sum of around 2.4 million words, are adopted for further semi-automated extraction and comparison.

The other two learner corpora are made up of the English production from Taiwanese ESL learners. The JCEE (Joint College Entrance Examinations) Testees Corpus, with an approximate total of 2 million words, is constituted with English written data by Taiwanese high school graduates on their college entrance exams.

Compiled by the College Entrance Examination Center in Taiwan, the JCEE corpus currently is for research purposes only. With regard to The Taiwanese Learner Corpus, it consists of about 1.8 million words, which were contributed by students from National Taiwan Normal University, National Tsing Hua University, National Taiwan Ocean University, National Taiwan University, National Taichung University, and Soochow University. The students composed online about various topics like technology, politics, education, school life, etc, with the length of three hundred to five hundred words per essay.

After the native and non-native corpora were obtained respectively, they were uploaded onto the SKE. The four Chinese ESL learner corpora were merged into one big corpus first, and then the BNC and the combined Chinese ESL learner corpus were analyzed with the functions on the SKE.

3.3 Data Extraction

One breakthrough of this study is the alternative manipulation of the Sketch Diff functionality on the SKE to accomplish a semi-automated fashion of both data extraction as well as analysis. In the past research, the discussions and criteria on what Verb-Noun structures to specify and what to filter out often took much of researchers' time and effort. By applying the tagging system and powerful sorting tools on the SKE, the author would, based on a list of frequent nouns generated from the Chinese ESL learner corpus, extract the target Verb collocates both from the BNC and the Chinese ESL learner corpus, and compare them with the Sketch Diff function.

First, if knowledge is selected as an example to compare between the native and non-native corpora, it is clear that the Sketch Diff provides a summary chart concerning the corresponding collocates of knowledge in distinct parts of speech positions (Figure 3.4). Then, since our focus is on Verb-Noun miscollocations, the left column with the heading object_of (that is, knowledge used as Objects) would be examined. The red area means those Verb collocates Chinese ESL learners tend to use with knowledge while native speakers never do, such as enrich (99 times), study (94 times), and master (58 times). The green part signifies the Verb collocates native speakers habitually apply to go with knowledge, but not vice versa for non-natives.

Those extreme examples which native speakers never produce are our target for further inspection.

Figure 3.4 Knowledge Compared between Native & Non-Native Corpora

Next, to probe into what the concordances are and how they are misused, the entry of enrich is chosen, and a list of lines are shown (Figure 3.5). This way, the author could examine the context quickly to decide whether the evidence given by the

Sketch Diff between natives and non-natives provides actual miscollocations or not.

The references adopted for this study are introduced in section 3.4, Data Analysis.

Figure 3.5 Concordances of enrich_knowledge in the Chinese ESL Learner Corpus

As for the keywords that were tested with the Sketch Diff interface, they were based on a list of the most frequently used nouns from the Chinese ESL learner corpus, generated online by the SKE. According to the suggestion of Liu (2002), nouns tend to be the main crucial indicators for learners' English Verb-Noun miscollocations. By inspecting the verb collocates of a noun, it is more efficient to capture the V-N misuse

than looking into the noun collocates of a verb. A similar idea is also proposed by Manning and Schütze (1999) with the term "focal word" indicating the crucial feature of nouns in V-N collocations.

In this study, the set threshold of frequency count was 300. That is, only nouns with no fewer than 300 frequency tokens would be incorporated in this study. This requirement eventually narrowed the number down to 690 key nouns to be compared (cf. Appendix A). On the other hand, only those V-N miscollocations found more than three times in the Chinese ESL learner corpus would be counted significant enough by the author for further comparison and discussion with the native speaker corpora.

In a semi-automated manner, the most frequently used nouns in the Chinese ESL learner corpus and their respectively common verb collocates in the ESL leaner corpus and the BNC corpus were checked one by one with the Sketch-Diff function.

Demonstrated above by the two colored areas (cf. Figure 3.4), it is obviously shown that certain verbs are significantly used more often by either natives or non-natives.

This, ultimately, is the target function on the SKE platform the author would like to apply in this study, i.e., manipulating the Sketch Diff interface to examine common Verb-Noun collocations in native corpora and non-native ones in a semi-automated manner.

3.4 Data Analysis

The analysis of the results provided by the Sketch Diff described in section 3.3 would be stratified in the following steps.

First, the suspicious V-N collocations, detected by the Sketch Diff function, which native speakers never used (0 token found in the BNC) were targeted. Then, only those suspicious V-N collocations found at least three times in the Chinese ESL learner corpus would be counted significant enough by the author for further comparison and discussion with the native speaker corpora.

Second, during the process of examination, based on the red area (Figure 3.4), which indicated those V-N combinations found at least three times in the Chinese ESL corpus but none in the BNC, the author would double-check the suspicious examples in the Corpus of Contemporary American English (COCA), another powerful online corpus, for further confirmation. Since the BNC is basically composed of linguistic input of British English, a parallel check of the suspicious V-N collocations on the COCA, mostly consisting of American English, could avoid any possible negligence.

Once those suspicious V-N collocations were double-checked on the COCA and there was no entry found, the author would regard them as V-N miscollocations for sure.

Third, due to the feature of the Sketch-Diff function, which treats Verb-Noun combinations and Prep-Noun combinations as two separate categories on the SKE platform, the author would not additionally extract possible Verb-Prep-Noun collocations from the Prep-Noun category for this study. All of the results displayed in Chapter IV are originally classified in the Verb-Noun category by the SKE. Even though some Verb-Prep-Nouns would be discussed, they were included because the Sketch-Diff function actually highlighted them as suspiciously wrong V-N

collocations (not found in the BNC). After the author looked into them, it was discovered that actually the verbs in the examples were acceptable, but that the prepositions after the verbs were deviant. The author, therefore, still considered them part of the results for general consistency and their original categorization as Verb-Nouns by the SKE online system.

Fourth, in terms of error classification, the possible types would be partially based on Chang and Yang (2009). As reviewed in section 2.1.3, Error Types of ESL Learners' Collocations, there are generally 12 kinds of Verb-Noun miscollocations (cf.

Table 3.1).

Table 3.1 Verb-Noun Types of Chang and Yang (2009)

Error Types Examples

1 Erroneous verb choice *learn knowledge

2 Misuse of delexical verbs *do recommendations 3 Erroneous use of idioms *get touch with them

4 Erroneous noun choice *tell a speech

5 Erroneous preposition after verb *reply letters 6 Erroneous preposition after noun *give sympathy to animals

7 Erroneous use of determiner *play piano

8 Erroneous syntactic structure *rang the phone 9 Erroneous choice for intended meaning *break my armed self

10 Redundant repetition *work one job

11 Erroneous combination of two

collocations *enjoy yourself a good time

12 Miscellaneous miscollocations which cannot be

categorized

Fifth, if the author cannot be sure to which category a V-N error should belong, a native speaker of English as well as other resources would be consulted, such Just the Word (http://www.just-the-word.com/), dictionaries like The BBI Dictionary of

English Word Combination, Oxford Collocations Dictionary, Oxford Advanced

Learner’s Dictionary, and the Collins COBUILD English Dictionary.

Finally, after a basic error categorization is compiled, the author would look into

Finally, after a basic error categorization is compiled, the author would look into