Manual tagging is required for all languages included in this project, since a fully automatic tagging system for errors is hard to accomplish. The most difficult part of manual tagging is the lack of consensus on a similar error rated differently by different raters. Granger (2003) pointed out that elaborated guidelines for tagging should be utilized with detailed principles for handling error categories. In general, two taxonomies for error coding have been commonly agreed upon in previous work, including linguistic category classification and a target modification taxonomy (Tono, 2003). The former refers to linguistic features such as lexis and tense, and the latter refers to features that differ from the form used by native speakers, such as omission and the change of order (Díaz-Negrillo & Fernández-domínguez, 2006). In the following section, we show several applications of the NCCU Learner Corpus.
5. Applications of the NCCU Foreign Language Learner Corpus
Learner corpora have been adopted in the analysis of various aspects of linguistic analysis, including the lexical analysis of words, collocations and colligations, as well as the analysis of syntactic structures. They provide an authentic resource for analyzing and observing learners‟ language, which might have implications for second language (L2) acquisition.
Contrastive interlanguage analysis (CIA) compares the language used by native speakers and that produced by language learners (Granger, 1996). As Granger noted, in the ICLE project, the function of the corpus collected is devoted to the CIA analysis according to two types of comparison are usually made- the comparison of native language (the reference in which corpus) and interlanguage (or non-native varieties), as well as the comparison between (or among) interlanguages. Due to the differences in the type of learner corpora (the ICLE is type A while the NCCU learner corpus is type B, cf.
Figure 1 previously), comparisons will possibly be conducted cross-linguistically to investigate how learners might perform in different languages.
The data in learner corpora are often contrasted with that of the native speaker corpora by centering on a linguistic feature to examine whether that feature is used more frequently (or overuse) or less frequently (underuse) than native speaker corpora. For example, Liu and Shaw (2001) evaluated EFL learners‟ knowledge of the verb make, which appears at a high frequency and has various meanings, by comparing the results of a learner corpus and a native speaker corpus. They questioned learners‟ qualitative knowledge of vocabulary instead of gauging the quantity of words learners know. The result showed that learners‟ knowledge of a word is different from that of the native speakers‟. In another study, Chen (2006) analyzed her self-collected corpus comprised of
papers written by Taiwanese MA TESOL students by using ten journal articles from two TESOL journals as the reference corpus. She explored the learners‟ use of conjunctive adverbials and found that the connectors were overused and sometimes misused at the word-level. Palacios-Martínez and Martínez-Insua (2006) examined Spanish learners‟ use of the existential there by analyzing two learner corpora in comparison to two native speaker corpora. They found that the uses of there differ in frequency, structural complexity, polarity and pragmatic value. In Gilquin, Granger, and Paquot‟s (2007) evaluation of learners‟ EAP writing, they compared learner corpus data with the native speaker data and found a number of problems that learners might encounter in academic writing. The learner corpora, moreover, can be utilized for materials design and
corpus-informed tools for learning.
In addition to research of written learner corpora, study of the results produced by spoken corpora also help researchers identify problems and features of learner language.
Shirato and Stapleton (2007), for example, examined a spoken learner corpus of Japanese learners of English, proving that the learner corpus is a useful tool for revealing how learner language differs from the native speakers‟.
With raw data, using the learner corpus, researchers can also investigate and contrast the raw frequencies of words or collocations. Though less sophisticated, Granger (1996:
45) still confirmed this as a “very fruitful undertaking”, in regard to Granger, Meunier, and Tyson‟s (1994) research on learner lexicon, which reveal learners‟ overuse of but and under use of and. The application of concordancing further provides evidence of how learners use a word in context and how it differs from the usage of native speakers.
When the corpus is parsed and tagged, research focusing on word categories and syntactic structures can be conducted. By tagging the errors in L2 corpora, studies of learners‟ errors under the framework of computer-aided error analysis can be conducted.
The ICLE has developed an error tagging system which utilizes purpose-built
menu-driven error editors. The Standard Speaking Test (SST) speech corpus has also adopted a machine learning technique to detect learners‟ errors automatically (Izumi, Uchimoto & Isahara, 2000). Research can thus be conducted to investigate interlanguage errors of specific linguistic features; for example, the connector usage in essays written by EFL learners of English (cf. Granger & Tyson, 1996).
Using the NCCU Learner Corpus, Chung and Tseng (2009) carried out a preliminary analysis of the preposition to, focusing on the collocations and senses of to used by language learners. The result showed that learners‟ misuse of this preposition only occurs in lower frequency words, indicating that learners learn the to-collocates in chunks. The errors were further analyzed, and possible reasons for the errors committed were
proposed. Moreover, comparisons to other languages are also possible as the NCCU Learner Corpus features a multilingual learner corpus. Through analyzing learners‟
language use, researchers and teachers can both benefit from uncovering the features of learner language and from revealing difficulties learners encounter, which would provide
6. Conclusion
More and more attention has been given to learner corpora in corpus building these years. This paper introduces a newly created learner corpus called the NCCU Foreign Language Learner Corpus. The construction of this corpus was facilitated by the numerous varieties of language courses available at NCCU.
At this stage, the NCCU Learner Corpus has been uploaded to an online interface, and some basic search functions have already been included. Additional training courses will be given on a continual basis to all members. The major difficulty faced by this project is the lack of professional programmers in languages other than English. Thus far, our project has accomplished the first stage of data collection, although the data are presented as raw data at the moment. These data are ready to be used for analyses despite the absence of annotations, which are expected to be added in the second phase of this project. In addition to keeping a comprehensive record of students‟ learning processes and teachers‟ pedagogical materials, the ultimate objective of this project is to encourage language educators to make further innovations in the pedagogical approaches, to
investigate the possible reasons for learners‟ language errors, and to carry out research into linguistic and educational issues as well as to provide a better understanding for language learning. It is certain that both language instructors and learners will benefit immensely by this project.
Through using corpora, teachers can also investigate how students use certain vocabulary items in writing and discover how these items have been used incorrectly.
This may thus prompt teachers to make advances in research.
Furthermore, by working on this project together, teachers can observe how features of different languages may influence language learning among students who learn more than two languages at the same time. This is one of the characteristics of contrastive interlanguage analysis, in which cross-referencing is carried out for different languages.
At the college level, this project not only serves to unite language education and linguists but also to encourage the exchange of teaching philosophies by teachers of different languages. Based on the abovementioned advantages, this paper has outlined the need and necessities in creating a foreign language learner corpus based on Taiwan contexts.
Acknowledgements
We would like to thank the funding from the NCCU Top-Universities Program, and Professor Nai-ming Yu, the Dean of the College of Foreign Languages and Literature and Professor Hsueh-ying Judy Yu,Vice-Dean of the College of Foreign Languages and Literature for their support of this project. We would also like to thank all of the participating professors–Professors Wen-lang Soo, Yoshida Taeko and Su-chin Wang from the Department of Japanese, Professors Yun-sen Sung, Gui-ying Peng, Hsiang-lin Yeh and Pei-chi Chang from the Department of Slavic, Professors Feng-lan Luo from the Department of Arabic, Professors Chieh-tsung Chang and Byeong-seon Park from the Department of Korean, Professors Yao-chueh Juan, Katarzyna Stachura and Thomas Haquette from the Program in European Languages (French), Professor Simon Smith from the Foreign Language Center, and Professor Chao-lin Lui from the Department of Computer Science for contributing in different ways to this project. This paper documents the background to the developments of the learner corpus, details how the NCCU Foreign Language Learner Corpus is built, which is under the supervision of the first author.
Further research based on individual languages as future research outcomes of this project will be highly encouraged. Finally, we would also like to thank the research assistants Tzu-yu Liu, F.-Y. August Chao, Liang-Chun Jonathan Wang, Yi-Chen Joy Hsieh and Chun-Hung Chen for their continuous support.
References
Atwell, Eric, Peter Howarth, and Clive Souter. (2003). The ISLE Corpus: Italian and German spoken learners‟ English. ICAME Journal, 27, 5-18.
Chen, Cheryl Wei-Yu. (2006). The use of conjunctive adverbials in the academic papers of advanced Taiwanese EFL learners. International Journal of Corpus Linguistics, 11(1), 113-130.
Chung, Siaw-Fong and Yu-Wen Tseng. (2009). Learning Prepositions: A Corpus-based Study in Taiwan EFL Contexts. Poster presented at the Third International
Conference Grammar and Corpora. Mannheim, Germany. September 22-24.
Cheung, Hintat, Siaw-Fong Chung and Sophia Skoufaki. (2010). Indexing Second Language Vocabulary in the Intermediate GEPT. In the Proceedings of the Twelfth
Academic Forum on English Language Testing in Asia (Language Testing in Asia:
Continuity, Innovation and Synergy). The Language Training and Testing Center,
Taiwan. March, 5-6. pp.118-136.Corder, Stephen Pit. (1981). Error analysis and interlanguage. London; New York:
Oxford University Press.
Díaz-Negrillo, Ana and Jesùs Fernández-domínguez. (2006). Error tagging systems for learner corpora. RESLA, 19, 83-102.
Fitzpatrick, Eileen and Steve Seegmiller. (2001). The Montclair electronic language
369-375). World Scientific.
Garside, Roger. (1987). The CLAWS Word-tagging System. In Roger Garside, Geoffrey Leech and Geoffrey Sampson (Eds.), The Computational Analysis of English: A
Corpus-based Approach. London: Longman.
Garside, Roger and Nicholas Smith. (1997). A hybrid grammatical tagger: CLAWS4. In Roger Garside, Geoffrey Leech, & Anthony McEnery (Eds.), Corpus Annotation:
Linguistic information from computer text corpora (pp. 102-121). New York:
Addison Wesley Longman.
Gilquin, Gaëtanelle. (2001). The Integrated Contrastive Model. Spicing up your data.