Chapter 3 Method
3.3 Corpora Analysis
Making use of the corpora as compiled and processed with methods summarized in the previous sections, analyses were conducted in the ways and with the tools introduced below. Keyword analysis (3.3.1) identified indicators for potentially
interesting directions for investigation; attempts were then made to identify terminology equivalents and translation units on the basis of selected keywords (3.3.2); usage
patterns associated with stylistic features of the legal language were explored in 3.3.3;
and finally, additional observations were attempted by using the translational and non-translational corpora in conjunction (3.3.4). A flowchart of the analysis process is shown in Figure 3.6.
Figure 3.6. Flowchart of analysis process and tools
3.3.1 Keyword Analysis
To quickly grasp an idea of the possible indications of proper nouns, theme, and style, keyword analysis (Scott, 2000) was respectively conducted on the processed English and translational corpora. The aim was that by automatic comparison of
frequency data, interesting points for further exploration will emerge (Flowerdew, 2012) to facilitate subsequent analyses. The reference corpus chosen for this study is the Brown Corpus (Francis & Kucera, 1964), a general corpus of approximately 1 million English words (features summarized in Table 3.2).
Despite its limitation in size and time coverage, the Brown Corpus is the most accessible and practical choice to non-academic users when compared to other general English corpora. The corpora are available in full texts, enabling the necessary
manipulation to suit the needs of different studies and corpus-based approaches. To facilitate comparison between the specialized and reference corpora, the Brown Corpus was re-tagged with the Stanford Tagger and Penn Treebank tagset before the analysis process in this study.
Keyword analysis was performed by the keyword list tool of AntConc 3.4.3 Table 3.2
Specifics of the Brown Corpus
Brown Corpus
Content Non-translational English texts of United States press reportage, editorial, and reviews; religion; skill and hobbies; popular lore;
belles-lettres; government and house organs; academic knowledge;
fiction (general, mystery, science, adventure, and romance); and humor
Size 1 million words
Representativeness General English of the United States Publication Time Jul. 1958-Jan. 1962
(Anthony, 2014a), a freeware which incorporates a number of tools for conducting corpus-based research (Anthony, 2014b). Settings were adjusted to include the
underscore “_” and tag as part of the words. Keyword lists were generated respectively for the English and translational corpora based on log-likelihood ratio (LLR), the default and recommended significance test for calculating “keyness,” or keyword strength (Anthony, 2014b; Dunning, 1993; Rayson et al., 2004). The user interface of the AntConc keyword list tool is shown in Figure 3.7.
After excluding results containing numbers, punctuation, and other symbols, keywords with keyness below the critical value 15.13 were also omitted, retaining only keywords that can be deemed with 99.99th percentile certainty to be a significant difference between legal corpora and the reference corpus (Rayson & Garside, 2000).
The frequency threshold of keywords was set at 3 occurrences, adopting the criteria recommended in Scott and Tribble (2006). Keywords that occur exclusively in the
Figure 3.7. Screenshot of AntConc keyword list tool. The interface shows the keyword list generated for the English corpora.
translational corpora were also identified by comparing the translational keyword list against the English corpora frequency list.
Part-of-speech distributions of the keywords were calculated by totaling the token frequencies of keyword part of speech. For easier observation, POS tags were roughly categorized into nouns, verbs, adjectives, adverbs, prepositions, determiners,
conjunctions, modals, pronouns, and foreign words. POS distributions of translational keywords and English keywords were calculated separately. Part-of-speech distribution was also tallied for keywords that occur exclusively in the translational corpora.
The keyword lists of the English and translational corpora, particularly the top-ranking keywords and along with keyword part-of-speech distributions, were used for making preliminary observations as to what translators will likely come across in working with legal texts. Selected keywords that may be of particular interest were then explored through other statistical techniques and with computational linguistics tools so as to address the research questions proposed in Chapter 1.
3.3.2 Terminology Equivalents and Translation Units
As suggested in previous studies, parallel corpora are an effective tool for identifying terminology equivalents (Bowker & Pearson, 2002; Toyama, 2011), with word alignments being even more advantageous than sentence alignments in the case of bilingual concordance programs (Gale & Church, 1991a). Based on the knowledge that nouns and adjectives provide more likely indicators of terms than words of other categories (Voutilainen, 2003), this study selected noun and adjective category
keywords for identifying terminology equivalents and associated translation units from the phrasal alignment and sentence alignment results.
Two types of keywords were selected for searches of terminology equivalents.
Indicators of theme or “aboutness” were selected from content words among the
high-frequency keywords of the translational corpora. Indicators of proper nouns were obtained from the list of translational keywords that do not occur at all among the English corpora.
With the exception of phrasal alignment search on entries containing keywords specific to the translational corpora, bilingual searches in either phrasal alignment results or sentence-aligned parallel corpora were conducted in CUC_ParaConc V0.3 (N.
Cheng, 2013), a screenshot of which is shown in Figure 3.8. CUC_ParaConc is a parallel-corpus retrieval program that accepts parallel corpora aligned at any level as supplied by the user, and supports bilingual and multilingual search functions with monolingual or multilingual search words (Cheng & Hou, 2012).
Due to the quantity of translational keywords absent from the English corpora as well as limitations of corpora size and software capacity, a batch search for the proper noun indicators is handled with the sed command in Cygwin environment. The search items were listed with the print command “p” and processed with the “-n” option, which
Figure 3.8. Screenshot of CUC_ParaConc bilingual search and retrieval interface. The results shown are those of a phrasal alignment search.
prevents sed from outputting lines unless a “print” request is supplied. Matching results containing the specified search items were then copied to a designated output file.
Possible terminology equivalents obtained from phrasal alignment search, either through CUC_ParaConc or the sed string-matching function, were then studied and selected in terms of their correctness or usability. By using partially correct alignments or abbreviations, searches were attempted to identify the full corresponding equivalent or proper noun through sentence-based bilingual concordance. Also, because phrasal alignment results generated by pialign can range from lengths of single to several tokens, some of the results include other co-occurring words and extended collocation.
Once amended in the same way, these search results provide lengthier translation units associated with key terminology that are readily usable.
3.3.3 Exploring Stylistic Features
As pointed out in Scott (2000), indicators of style identified through the keyword approach often appear to be function words with unusually high frequencies, therefore not likely ideal candidates for identifying terminology equivalents. However, translators require more than bilingual dictionaries to complete their jobs, and some of the ways in which corpora can serve as useful reference tools include informing writing style and idiomatic usages (Bowker & Pearson, 2002).
Phraseology, as Stubbs (2001) pointed out, is an important subject of linguistics, and corpora can facilitate study on these recurring, multi-word phrasal units of
natural-sounding language use. This study therefore investigated n-grams and concordances associated with style indicators as an approach to exploring useful collocation, colligation, and other usage patterns in legal English.
N-grams and monolingual concordance were studied with the aid of the AntConc n-gram/cluster tool and concordance tool, respectively. POS-tagged versions of the
English corpora were used for studying n-grams to better observe colligation patterns and POS sequence where relevant. To exclude n-grams spanning different sentences or sentence parts, the corpora were processed to have line breaks are inserted after
punctuation marks. During analysis, the line break replacement option was cancelled in the settings of the AntConc n-gram/cluster tool. The number of texts containing the found n-gram entries is also provided by the tool, helping to eliminate results that may be specific to only certain topics or authors (law drafters, translators). A possible starting point for analyses is n-grams of three or more sequential words, occurring at least 20 times per million words across five or more different texts, as recommended for lexical bundles in Biber and Conrad (1999), Biber, Conrad, and Cortes (2004).
Monolingual English concordances obtained by the AntConc concordance tool were sampled with the method provided in Sinclair (2003), aiming to extract samples evenly distributed over all texts in the corpora. A batch of 25 samples was taken for each object of study; the first sample is selected at random among the 4% of all generated concordances, and each sample afterwards is selected automatically after skipping 4% of concordance hits since the previous selected instance. The 4% gap between concordance samples was calculated for each searched item by dividing the number of all found samples by 25.
Analyses were then attempted following the instructions of Sinclair (2003) and Hunston and Francis (2000). The sampled concordances were observed for conspicuous patterns that surface on either side of the queried keyword. Endeavors were made to formulate hypotheses on usage patterns associated with the style indicator in question, taking into account the part of speech of the combination of words as well as the meaning of the keyword. A summary was then attempted to describe the idiomatic usage of the identified patterns in a legal context.
The above sampling method was also applied to bilingual concordance lines from the parallel corpora, extracted instead with CUC_ParaConc and studied in a similar fashion for translation strategies associated with the selected style indicators. Revealing strategies that previous translators have used to overcome constraints imposed by the source texts is an important function that parallel corpora serve in the translator training environment which supplements the features of comparable corpora (Bernardini et al., 2003; Pearson, 2003).
3.3.4 Utilizing Translational and Non-translational Corpora
Statistical machine translation systems rely on not only a translation model for identifying the appropriate word sets translated from the input text, but also a language model which ascertains the correct word sequence in the target language (Somers, 2003).
Similarly, translators can turn to parallel corpora for identifying terminological
equivalents (Bowker & Pearson, 2002), but monolingual corpora in the target language are often found useful for providing information on “native-like” means of expression (Bernardini et al., 2003). It is therefore deduced that resources of parallel and
monolingual corpora can be used in combination to provide more comprehensive information to the translator.
As proposed in Baker (1995) and confirmed in Laviosa (1998), comparison of translational corpora and non-translational corpora will reveal patterns specific to translated texts. In addition to the keyword lists (as described in Subsection 3.3.1), therefore, terminology equivalents, translation units, and style-related usage patterns identified were also used in this study as starting points of comparison between the translational and English corpora. By conducting n-gram and concordance searches associated with the identified equivalents and patterns, efforts were made to identify additional phrase-like units and information that can aid the process of legal translation.
In the case of terminology equivalents and translation units, comparisons of their frequencies in the translational and English corpora will help verify which of the corresponding word sets are likely preferred or more common in legal English, while possibly uncovering similar usable phrase-like units. Concordance searches can further ascertain the contexts in which these word sets are used and whether or not these contexts are similar to one another or associated with specific phrasal units.
Comparison between bilingual and English concordance lines containing the same patterns, whether terminology or style related, will help determine if certain translation strategies are preferred over others when aiming to achieve idiomatic usage appropriate in legal English. It is also possible that additional translation strategies will be deducible from comparable English concordance results that are not apparent by simply observing the translational corpora.
Based on an initial keyword analysis, therefore, the above methods were used to identify and explore terminology equivalents, translation units, style-related patterns, other phraseology features, and translation strategies from within the parallel and non-translational corpora, with an aim to summarize useful information and provide insights to legal translators. The results obtained from this process will be presented and discussed in the next chapter.