Chapter 2 Literature Review
2.2 Corpora as Translation Reference Tool
Recent years have seen the application of corpora expand beyond the
well-established use as a basis for compiling dictionaries and grammar books to include application in several new fields (Flowerdew, 2012). In the face of drastic technological developments and industry changes in the world today, corpora and corpus linguistics have become increasingly important in fields of translation, including corpus-based translation studies, corpora as reference sources in translation practice, computer-aided translation technology, and corpora as teaching or learning aids in translator education (Bernardini et al., 2003; Chen, 2012). The following subsections will focus on the categorization of corpora and their usage in translation practice (2.2.1) as well as the designing of specialized corpora for aiding the translation process (2.2.2).
2.2.1 Corpus Typology and Usage
There are a vast number of ways to categorize corpora according to their content, such as written or spoken, subjects of the corpora (general or specialized), the time periods they cover, the languages included, and whether and how the corpora have been processed in certain ways (Lee, 2010).
The types of corpora most commonly referred to with regard to usage in translation likely include the following three categories: monolingual corpora, comparable bilingual corpora, and parallel corpora (Bernardini et al., 2003).
Monolingual corpora are usually mentioned with regard to the target language and are useful for providing information on “native-like” means of expression to the translator.
Comparable corpora refer to corpora comprised of two or more subsets of
non-translational corpora in different languages and selected according to analogous design criteria. They provide linguistic as well as cultural information, typically of the same subject domain, on both the source and target languages for reference and
comparison. Parallel corpora, meanwhile, consist of original, or non-translational, texts in a source language and their translations into a target language. They allow users to observe what strategies translators have used to overcome constraints imposed by the source texts in the process of translation.
The application of parallel corpora to investigating translation strategies was demonstrated in Pearson (2003). By examining a small set of culture-specific references in popular science articles, participants of this study were able to observe strategies used by translators in dealing with situationally-constrained expressions. It was therefore confirmed that there is an important role for parallel corpora in the translator training environment, in which it can serve a fairly different and supplementary function to comparable corpora.
Bowker and Pearson (2002) summarized a number of ways to investigate a corpus using computerized tools to obtain some of the above-mentioned information frequently sought after in translation. Monolingual corpora in the target language are useful reference tools for verifying information, such as ascertaining whether possible terminology equivalents are correct, if a certain collocation is appropriate, and if a usage or pattern is idiomatic. They can provide information on writing style and conceptual explanations, or even be used to identify translation equivalents. For example, users might conduct a context search, narrowing search results to those with one pattern occurring in the vicinity of another, or generate a list of word clusters containing a certain pattern to acquire equivalents that were previously unknown to them.
Parallel corpora, aside from enabling investigation of translation strategies, can serve to provide information on term usage, collocation, and writing style in translated texts, and even be employed for identifying terminological equivalents (Bowker &
Pearson, 2002). In fact, it was stated in Toyama (2011) that parallel corpora can be
viewed as bilingual dictionaries in this respect. If incorporated with CAT tools and technology, parallel corpora can provide additional and readily usable material for constructing TM segments (Quah, 2006).
Comparable corpora, which comprise two or more sets of monolingual corpora on subjects in the same domain, can provide the same functions as monolingual corpora when used as a reference tool for translation (Pearson, 2003). Scholars have also
maintained that compared to parallel corpora, comparable corpora have the advantage of being easier to compile with higher quality because more monolingual texts are
available, though establishing categories and sampling procedures may cause potential difficulties (Maia, 2003).
The term comparable corpora as defined above differs from that in Baker (1995), which referred to the same type of corpora as multilingual corpora in the context of descriptive translation studies. The term comparable corpora was instead reserved for corpora consisting of original texts written in a certain language and translated texts into the same language. Such corpora would effectively include a monolingual corpus and translational corpus of similar design. It was proposed for this type of comparable corpora to be used in identifying patterns specific to translated texts.
This definition of comparable corpora and line of research was adopted in Laviosa (1998), which compiled an English Comparable Corpus comprising a monolingual (non-translational) subset and a translational component of newspaper articles and narrative prose. The study found four major patterns in lexical use of translational English texts, including lower lexical density (percentage of content words against functional words), higher proportion of high-frequency words, more repetition among the most frequent words, and fewer lemmas, as compared against
non-translational texts in the corpus.
2.2.2 Designing of Specialized Corpora
Utilization of corpora as a reference tool, as in the ways described above, was recommended in Bowker and Pearson (2002) to translators, who are often required to learn the language for communicating on the specialized subject field they are working with. A language variety of this type is termed a “language for special purposes” (LSP), and can be more effectively acquired through consulting a special purpose corpus, which presents a particular aspect of a language, such as an LSP of a particular subject field, a specific text type, or a particular language variety. Specialized corpora can be a valuable complement to other reference sources, especially since dictionaries, printed texts, or other conventional LSP-learning materials may be constrained due to
incompleteness, physical volume, time requirement, or unavailability.
In addition to the purpose intended, as well as languages and subject domains to include, there are a number of issues to consider when designing a corpus, including size, full or excerpt texts, authorship, text format, and even copyright. Corpora can also consist of written or spoken language; they can be synchronic, meaning they are
representative of the language use within a limited time frame, or diachronic, which facilitate studies on how the language evolved over time; they can also be constantly expanded and changed (open) or of a finite size (closed). McEnery and Wilson (2001) pointed out that a corpus must be representative of a language variety; other generally recommended criteria include a reasonably large size, full texts rather than excerpts, texts by a variety of authors, and electronic format, but corpora size ranging from thousands to hundred thousands of words have all been effective for LSP studies (Bowker & Pearson, 2002).
A very specialized type of corpora was explored in Varantola (2003), namely the disposable (ad hoc) corpora collected for the needs of single translation assignments.
From the Finnish-English/English-Finnish translation assignments completed by
workshop participants, it was observed that corpora benefited the translation projects by providing reassurance for strategic and lexical decisions, especially in cases of radical decisions to break from the source material. Some participants also found corpus evidence to support choices of register, because target audience for their particular assignment was taken into account in the stage of corpus compilation. However, participants of the study also questioned the cost-efficiency of corpus compilation, as the undertaking had proved difficult for reasons including accessibility and reliability of many materials.
An example to the other side of the spectrum is perhaps the bi-directional
Portuguese-English parallel corpus Compara (http://www.portugues.mct.pt/Compara/), the design of which does not address issues of corpora balance and representativeness (Frankenberg-Garcia & Santos, 2003). The corpus is open-ended, with no
pre-determined rules as to what variety of texts could be included. The texts initially included were fiction, because texts of other genres were either not common, lacking in either language direction, questionable in quality, or often relayed (translated into Portuguese or English from a third language). However, users are given the options of narrowing down the varieties of language, subject, publication date, author, or translator when conducting searches, effectively allowing corpus users to work with tailored sub-corpora to serve the specific purposes of their tasks at hand.