Computational Linguistics and Corpus Processing

Chapter 2 Literature Review

2.3 Computational Linguistics and Corpus Processing

Computational linguistics was defined in Mitkov (2003) as the field of studies

“concerned with the processing of language by computers” (p. x). Many technological applications today function on the basis of computational linguistics techniques;

machine translation, information retrieval, speech recognition, and text data mining are

just a few of the numerous examples.

Corpora data have played an essential role in the development and evaluation of many natural language processing applications (McEnery, 2003). At the same time, corpus linguistics has also benefitted from incorporation of these increasingly

sophisticated language processing programs. The subsections below will introduce the technologies supporting the corpus processing programs this study makes use of:

part-of-speech tagging (2.3.1), statistical machine translation (2.3.2), sentence alignment (2.3.3), and phrasal alignment (2.3.4).

2.3.1 Part-of-Speech Tagging

Part-of-speech (POS) tagging refers to the automatic assignment of grammatic tags, which are attached by computer programs to indicate the POS category of input words (Voutilainen, 2003). POS tagging is perhaps the most common type of annotation, namely the “addition of explicit linguistic information” to corpus texts (Bowker &

Pearson, 2002, p. 229).

McEnery (2003) summarized four key advantages to annotating a corpus. Firstly, annotation increases the ease of corpus exploitation by making the results of corpus analyses available to human users unfamiliar with the language as well as machines.

Users capable of performing the analyses can also save time by obtaining the

information directly from the annotations. Secondly, annotation allows the results of analyses to be recorded for reuse without unnecessary repeat of analyses. Thirdly, annotation enables corpus analyses to serve multiple functions, including purposes for which the analyses were not originally intended. Finally, annotation makes explicit the interpretation performed, and by opening them for scrutiny enables them to stand more objectively than interpretations unrecorded.

POS tagging provides information for addressing a number of linguistic issues.

As pointed out in Reppen (2010), many words have multiple meanings which can belong to different word categories and could not be distinguished from spelling.

Working with a POS-tagged corpus allows users to disambiguate among such polysemous words in frequency lists and other empirical results. Users can therefore focus on a specific word class, or filter out irrelevant search results. For example, if a researcher wishes to study the high-frequency verbs in a specialized field, or narrow down search results to the modal “can” instead of including its other POS forms, they can do so relatively easily by exploiting POS tags (Bowker & Pearson, 2002; Reppen, 2010). With term extraction applications, POS tags can improve identification of terminology candidates for automatic retrieval, based on the knowledge that nouns and adjectives provide more likely indicators of terms than words of other categories (Voutilainen, 2003).

The architecture of most taggers includes functions for tokenization, ambiguity look-up, and disambiguation. Word boundaries must first be identified to divide the input text into units that allow analysis, a process that is also referred to as word segmentation. Taggers then begin assigning possible POS solutions to input words by use of a lexicon, which is essentially a collection of word forms and their corresponding parts of speech. The same information may also be provided in the more economic form of generalized morphological rules. Tokens not included in the lexicon are then assigned with possible POS solutions by use of a guesser, which proposes reasonable analyses by eliminating unlikely alternatives based on information about the lexicon; the lexicon could be known to include all pronouns and articles, for example, and thus allow the guesser to eliminate these two word classes as possibilities. Finally, remaining ambiguities are resolved based on word information and contextual information encoded in the tagger. Word information includes knowledge such as the likelihood or

frequency of a word being used as a particular word category over another. Contextual information refers to probabilities of POS sequences that enable deduction of the appropriate analysis (Voutilainen, 2003).

A point worth noting in tokenization is that while word-level segmentation presents relatively less challenges with languages in which words are delimited by a white space, the same process is significantly more complicated for Chinese and other languages in which tokens directly precede and succeed each other (Mikheev, 2003).

Word boundaries must then be identified by turning to statistical methods such as maximum sequence matching, n-gram methods, and other probabilistic models.

Word segmentation, like other such text processing applications as text alignment, also requires sentence segmentation and is affected by the quality of its results. Sentence segmentation is usually performed in earlier text processing stages with regular

expressions, introduced in Lu (2014) as special characters that can be used for

specifying patterns. Sentence boundaries are most commonly identified with a sequence of sentence terminal, blank space, and capital letter. The error rate produced by such an algorithm can be reduced by supplementing information such as abbreviations that are never located at sentence endings, or words that always begin a new sentence when capitalized and succeeding a period (Mikheev, 2003).

2.3.2 Statistical Machine Translation

Corpora can be said to have founded the basis for a new paradigm in machine translation that emerged in the 1990s, since which time corpus-based methodologies have been explored by researchers in addition to the ongoing and more traditional, linguistic rule-based approaches (Somers, 2003).

Machine translation (MT) was defined by the European Association for Machine Translation (EAMT) as “the application of computers to the task of translating texts

from one natural language to another” (http://www.eamt.org/mt.php). Statistical

machine translation (SMT), in particular, is an MT method which uses statistical means and a parallel corpus of previously translated texts to deduce the most probable

translation for input texts (Mitkov, 2003; Somers, 2003).

The SMT approach differs significantly from traditional MT methods in that it is highly non-linguistic (Somers, 2003). Appropriate translations determined by an SMT system are based on two sets of statistical probabilities: firstly, the likelihood that a particular set of words in the source text will give rise to particular combinations of target text words; secondly, the possibility that the generated words are arranged in correct sequences in the target language. These two sets of data manifest as a

“translation model” and “(target) language model,” respectively (p. 516), which are typically a parallel corpus, most likely aligned at the sentence level, and a monolingual corpus of the target language(s).

Once provided with a source language text to translate, an SMT system divides the input text into units of word groups or phrases. The source text units are then compared against a parallel corpus, from which the translation model identifies a number of target language units that likely translate the source units. The possible equivalent units are then passed on to the language model, which determines the most probable word sequence in terms of linguistic validness in the target language based on n-gram probabilities derived from the monolingual corpus. The SMT system then outputs the results with the highest probability of being an accurate translation of the source text and linguistically valid word-sequence combinations in the target language (Somers, 2003; Quah, 2006).

Machine translation has also initiated much of the modern interest in parallel texts and in turn alignment (Gale & Church, 1991a), which is introduced below.

2.3.3 Sentence Alignment

An important process of compiling parallel corpora is alignment, which refers to the mapping and binding of corresponding source and target text units that translate each other. This process, often performed automatically by computer programs, can be carried out on text units at different levels, including paragraphs, sentences, phrases, and words. The technique is required in a wide variety of applications; in addition to

compiling parallel corpora, it is used for compiling translation memories, dictionaries, and bilingual glossaries, while also applied in cross-language information retrieval (Véronis, 2000; Bowker & Pearson, 2002; Quah, 2006).

Véronis (2000) pointed out that most alignment methods at sentence level are based on one or both of two major principles: lexical anchoring and sentence length correlation. Lexical anchoring methods make use of corresponding lexical elements, which are established as “anchor points” and a basis for identifying likely sentence alignments. These lexical anchors can be word pairs, either word-level alignments derived from texts to be aligned, or word translations obtained from an external bilingual dictionary; or, they may be “cognates,” which are graphically similar or identical elements such as names, dates, figures, symbols, special punctuation marks, or words with similar spelling in the source and target languages.

An early example of lexical anchoring was given in Kay and Röscheisen (1993);

the study proposed a sentence alignment method supported by partial word-level alignments derived from word distributions in the texts on which sentence alignment was to be performed. The theoretical basis for this method arised from the observation that sentence pairs containing an aligned word pair will certainly be appropriate sentence alignments as well. Using an initial set of possible sentence alignments based on their location within the texts, a most likely set of aligned words is identified

according to the tendency of their appearance in corresponding sentences. The aligned word set is then used to calculate new results of aligned sentence pairs. The resultant information contributes to a new estimate of possibly aligned words, and the induction process is repeated until no new sets of sentence alignments are found.

Sentence length correlation methods, on the other hand, were derived from the knowledge that the lengths of translated sentences have a tendency to correlate highly with that of the source sentences from which they originated (Véronis, 2000). The statistical model proposed by Gale and Church (1991b) based its calculation of sentence length on the number of characters per sentence. According to empirical data, the researchers determined the mean and variance of the ratio of target text characters per source character, in other words, the number of target text characters that each source text character gives rise to.

Sentence alignments were categorized into four types, for each of which their probabilities of occurrence were calculated. The four types were one source text to one target text sentence alignments, one source or target text sentence with no corresponding counterparts, one source or target text sentence to two matching sentences, and two source text to two target text sentence alignments. The above information and lengths of the proposed sentence pair being considered are incorporated to compute a probabilistic score, with which the maximum likelihood for sentence alignment is derived.

A hybrid model making use of both lexical anchoring and sentence length correlation methods was proposed in Brown, Lai, and Mercer (1993). Working with records of Canadian Parliament proceedings, i.e., Hansard, the study used existing comments such as speakers or time as anchor points. After aligning subsections of the French and English records as divided by the anchors, a probabilistic model computed sentence alignments within subsections based on sentence length by word count.

Despite the differences across sentence alignment methods that have been proposed, these alignment models generally operate on a number of common

assumptions about the source and target texts to be aligned. It is often assumed that the source and target text will largely correspond sentence by sentence, in approximately if not exactly the same order, with very few one-sentence-to-two, two-to-one, or

two-to-two correspondences, very few omissions, and additions (Véronis, 2000).

However, as pointed out in Frankenberg-Garcia and Santos (2003), source text

sentences are quite often split, combined, inserted with additional elements, or reordered during the translation process. Such alterations create considerable problems for

automatic alignment programs. In fact, evaluations of the alignment model in Gale and Church (1991b) showed that in the case of sentence pairs involving addition or deletion, the alignment program had never achieved correct results. The possibility of three or more sentences in either the source or target text of an aligned segment was not considered in the statistical model, yet such occurrences do still exist. It is therefore quite likely that manual adjustments and correction of misaligned results would often be required to obtain more satisfactory sentence alignments.

2.3.4 Word and Phrasal Alignment

Accuracy in sentence alignment becomes an important issue when the results are used as starting point for word-level alignment, in which case partially correct sentence alignment is no longer sufficient (Véronis, 2000). Processes of lexical alignment or extraction typically consist of two phases: detection of words or expressions in the source and target texts, followed by the mapping of those expressions onto each other.

To overcome the costs and language specificity constraints of linguistic

approaches, researchers have continued to develop statistical-based methodologies for lexical alignment. In the automatic translation approach they proposed, Brown et al.

(1990) introduced statistical techniques to facilitate automatic glossary compilation based on the belief that in a large corpus, the correct translation for a given

source-language word will occur significantly more frequently than other candidates in their corresponding target language sentences. To account for differences in lengths between source- and target-text sentence pairs, the algorithms were further refined by accounting for source text words that produce “null words” (words for which a correspondence does not appear in the target text) or secondary words.

Addressing constraints in case of source and target languages with different ordering arrangements, Wu (1995a; 1995b) introduced another automatic approach for identifying phrasal translation units. This method makes use of an inversion

transduction grammar (ITG), a probabilistic formalism for bilingual language modeling and parsing. The input sentence pairs undergo syntactic analysis in order for supposedly correct grammatical structures to be extracted. The ITG algorithms generate separate output streams for both the source and target language and match the corresponding constituents from the two streams, allowing for constituents to be paired up in either a left-to-right or inversed order. ITG, therefore, provides a language-independent and sequentially flexible approach to extracting several types of linguistic information from parallel corpora, including aligned phrasal or word units.

Several studies have later made use of or developed from the foundation of ITG.

One of those studies is Neubig, Watanabe, Sumita, Mori, and Kawahara (2011), in which an unsupervised probabilistic model for extracting phrasal alignments at multiple syntactic levels. Instead of building up from minimal phrase alignments, this ITG-based model generates phrase pairs at every branch of the syntax tree. The end result is a phrase table for SMT translation models that includes phrases at levels ranging from words to full sentences.

While the majority of bilingual concordance programs are based on sentence alignments, a word-based program will be highly advantageous to the user as it can identify the correspondence to the input word without requiring the user to supply the possible corresponding words in a second language (Gale & Church, 1991a). In Dagan, Church, and Gale (1993), which presented a word alignment method developed from a later model of Brown et al. (1990), the researchers also pointed out that word alignment programs can help translators save considerable time by providing them with results of terminology questions already solved by other translators. In fact, a word-based

bilingual concordance program has doubled or even more than tripled the speed with which translators produced bilingual terminology lexicons at the partner institution of this study. Even without comprehensive alignment results for all the input words, word or phrasal alignment can be helpful to translators (and lexicographers) in addressing issues of difficult terminology (Dagan et al., 1993).

在文檔中法律翻譯語料庫建置及分析 (頁 31-40)