Siaw-Fong Chung, Shu-Yi Wang, Yu-Wen Tseng
1. Introduction
Learner corpora usually refer to a collection of written and/or spoken texts produced by foreign or second language learners. These types of corpora document data verbatim from learners‟ production of a target language in which specific features such as errors or non-standard characteristics in the learners‟ language are considered as interlanguage (Selinker, 1972; Corder, 1981) between the mother tongue and the target language. The most often used methodology in analyzing a learner corpus is contrastive interlanguage analysis (CIA) (cf. Gilquin, 2001; Granger, 1996), a method in which features in a learner corpus are checked with those in a reference corpus which is based on native speaker data.
When comparing content of the two corpora, the existence of certain features or the lack of them will be considered as specific characteristic of the learners‟ learning process.
To date, many learner corpora of English have been created and these corpora include English data by foreign or second language learners of various backgrounds. The
International Corpus of Learner English (ICLE) (Granger, et, al., 2009) is an established learner corpus documenting learners of different mother tongue backgrounds in Europe.
ICLE also collected data written by Chinese studying in the Europe. As for learner corpora based on texts produced by Chinese learners, the Spoken and Written English Corpus of Chinese Learners or SWECCL (Wen, Wang, & Liang, 2005 & 2007) from China is a collection of test materials based on the English produced by Chinese learners of English in China. A recent Taiwan-based learner corpus of English has been collected by the Language Testing and Training Center (LTTC) based on texts produced by examinees taking the General English Proficiency Test (GEPT) (cf. Cheung, Chung &
Skoufaki, 2010). These corpora are all based on texts produced by learners of English.
There are few learner corpora that are of texts produced by learners of other foreign languages and there are even fewer which comprise of a collection of foreign languages within one same corpus. CPATEI (Spanish-English Learners Written Parallel Corpus) (Lu
& Lu, 2009) is a project in Taiwan which collects learner data in Spanish produced by Taiwanese learners.1 The same project also collected data from texts in Japanese, German and Chinese written by Taiwanese learners. Another project is the project of International Corpus of Crosslinguistic Interlanguage (ICCI).2 The ICCI project aims both at collecting data from learners of English as well as from learners of different foreign languages in countries such as Austria, China (Hong Kong), Israel, Poland, Singapore, Spain and from Taiwan. The Taiwan data in the ICCI project come mainly from students studying foreign languages at the LTTC. These projects have a similar aim – to collect data from learners of various mother tongue backgrounds who are
1 http://corpora.flld.ncku.edu.tw/
2 http://cblle.tufs.ac.jp/llc/icci/
learning different target languages. The following Figure 1 summarizes the three main types of learner corpora.
Figure 1: Types of Learner Corpora
Target Language Mother Tongue
Different Languages
English (E.g., French, German, Japanese, Spanish, Mandarin, etc.)
Different Languages
(E.g., French, German, Japanese, Mandarin Spanish, English, etc.)
Different Languages Different Languages
(E.g. French, German, Japanese (E.g. French, German, Japanese Spanish, English, etc.) Spanish, English, etc.)
In Figure 1 above, most of the existing learner corpora fall under type „A‟ with English as the target language produced by learners from various language backgrounds.
Type „B‟ is a different kind of learner corpus because only one type of learners is targeted at – learners whose mother tongue is Mandarin Chinese. In type „A,‟ learners whose mother tongue is Mandarin constitute part of the many types of learners‟ language backgrounds. As for type „B,‟ learners who speak Mandarin Chinese as their mother tongue constitute the only type of language background while the targeted languages are many, including English which, in contrast, is the only targeted language in type „A‟. In type „C‟, language data are produced from learners of different language backgrounds who are learning different target languages.
In this paper, we will detail the construction of a learner corpus based on learners at National Chengchi University (NCCU) who are learning different languages, i.e., type
„B‟ in Figure 1. This newly created learner corpus is called the NCCU Foreign Language Learner Corpus (hereafter NCCU Learner Corpus), which is funded by the College of
A
B
C
languages by collecting NCCU learners‟ written texts in both soft- and hardcopies. In terms of data collection, the College of Foreign Languages and Literature in NCCU is privileged in the sense that it includes language courses taught in twenty-three different languages. Therefore, in terms of learning environment, NCCU provides a good resource of data collection based on Taiwanese learners of various foreign languages. Since learners of various target languages can be found in NCCU, a learner corpus built from these languages will benefit research in the fields of second language teaching and language pedagogy.
The above are some of the motivations which explain the rationale behind the establishment of a foreign language learner corpus in NCCU. The overall aim is to enhance the quality of language education and to boost research using local based data.
As of the second semester of the academic year of 2008, there were sixteen participating professors in this project and they are experts in the following languages: English, French, Japanese, Korean, Russian, and Arabic. At this stage, only written assignments have been collected for these languages. Spoken data will only be considered at a later stage in the development of the learner corpus.
In this paper, we introduce the features of the NCCU Learner Corpus and at the same time, we provide documentation of how this corpus came into shape. We review some learner corpora and discuss the steps necessary to create our learner corpus, all of which are crucial information for the construction of a learner corpus. In addition, we also provide future prospects of this learner corpus and discuss the applications of the corpus.
In the section below, we first review two of the learner corpora that we have mentioned previously – the ICLE and the SWECCL.
2. Learner Corpora in Use: ICLE and SWECCL.
Learner corpora in English are seen in various forms. SWECCL 1.0 and 2.0 (Wen, et,
al, 2005 & 2007), two versions of the Spoken (SECCL) and Written (WECCL) English
Corpus of Chinese Learners created in China, were launched from 1996 to 2007. The team of the SWECCL project collected recorded audio files for the SECCL from Test for English Majors (TEM) and English learners‟ writings in China for the WECCL.The steps involved in data collection to establishment of the SWECCL corpus can be summarized by the authors of this work in Figure 2 below.
First, the project team decided to collect data from the TEM and writing assignments from college students. After collecting all data, the team calculated the volume of data and made duplicates for filing. When the data were all classified, the team started the typing work including pre-training and assigning works to typists. The typists submitted the digitalized data for electronic storage. The team then conducted a comprehensive review of the digitalized data. To ensure that all the data was valid, spot checks were conducted after two different comprehensive reviews and then the metadata and taggers were added for storage into the corpus.
Figure 2. Flow Chart of the Establishment of the SWECCL
Each of the two versions of SWECCL contains around 2,000,000 tokens, respectively and all data were tagged. The SWECCL used CLAWS4 (Garside, 1987; Garside. & Smith, 1997 ), a grammatical tagging system established by Geoffrey Leech, Roger Garside and Michael Bryant at Lancaster University in the United Kingdom, as its parts-of-speech (POS) tagging system and the corpus was also lemmatized and error-tagged. As for the SECCL (spoken), features of grammatical errors, mispronunciation, disfluency, self repetition and pause fillers were also tagged. Example (1) below provides a tagged sample of spoken errors in SECCL (Wen, Wang, & Liang, 2005: 27-29).
(1) Grammatical error: has <had>