Text Processing - Corpus Processing and Annotation

Chapter 3 Method

3.2 Corpus Processing and Annotation

3.2.1 Text Processing

The retrieved and downloaded webpage files were processed in Cygwin

environment to remove unnecessary information and convert from webpage to plain text format. As this study focuses on statute language alone, notes, source credits, tables and formulas in the U.S.C. are not included in the English corpora, while all tables and appendix file names were removed from the Chinese and translational corpora.

Identification and removal of non-statute information was completed semi-automatically with the sed command in combination with (extended) regular

expression. Sed is a commonly used text-processing command with string pattern search, substitute, and delete functions (Barnett, 2015). Regular expressions, basic and

extended, are special characters that can be used for specifying patterns (Lu, 2014).

Figure 3.2. Flowchart of corpus processing procedures and tools

Frequently used characters include positional anchors for specifying positions in a line;

wildcards, such as the period, which matches any single character; characters for specifying various numbers of repetition; and expressions for character classes, such as alphabetic characters, digits, or all characters except specified exclusions.

Figure 3.3 shows a partial list of sed commands drafted for processing the English corpora, compiled into a file to be processed at once and performed on multiple files by specifying the “-f” and “-i” options. When information indicating notes, source credits, tables, or formulas, was found in the HTML tags, the appended “d” command instructs sed to delete the specified lines. For example, tags containing “class="note” indicate a line of notes, while tags containing “table” and “/table” indicate the start and ending lines of tables. Commands are therefore specified as follows to remove lines of notes and the ranges of lines from the start to ending lines of tables:

9 /class="note/ d 12 /<table/,/\/table/ d

To ensure intelligibility after processing, HTML decimal codes for special

characters or symbols were then replaced with English letters, punctuation, and numbers.

Figure 3.3. Excerpt of sed commands for processing the English corpora

For example, an apostrophe would be indicated by the string “'” in HTML code.

To display the punctuation form of apostrophes in the corpora, the substitute command of sed is specified to change occurrences of the string “'” into an apostrophe symbol, and a global command “g” is tacked on to search the entire line for multiple occurrences, instead of moving on to the next line once an occurrence is found:

s/'/'/g

To facilitate the subsequent part-of-speech tagging, lines were also combined when contents of the same sentence are spanned over more than one line; paragraphs containing more than one sentence were divided into multiple lines wherever possible.

A final text processing task is later performed after tagging and sentence

alignment but prior to the extraction of phrasal alignments: all three sets of corpora were processed to edit out the list item markers at the beginning of lines, while in-text

numerals were replaced with the hash symbol (#). These items were edited because while headings, listings, and numerals are useful in the sentence alignment process, they do not contribute to the primary object of subsequent analyses.

List item markers, which can take the forms of digits, roman numerals and alphabetical letters, often cause confusion for the POS tagger and therefore lead to mismatching in later analyses by computerized tools. For example, the tagger can be unable to annotate all items on the same numeral list consistently, resulting in some markers being tagged as cardinal number (CD), while others are deemed as list item markers (LS). The list item marker (a) is sometimes mistaken for a determiner (DT) or noun (NN); problems in distinguishing list item markers from other words go on to influence word frequencies, keyword analysis, n-gram/cluster frequencies, and

concordance matches. In-text markers and numerals were replaced with the hash symbol (#) because this study is more interested in the general patterns associated with list item

markers or numerals, rather than the actual marker or numeral that occurs. With all numerals represented by the same symbol, patterns are also more likely to surface.

Corpora data after text processing therefore add up to approximately 2.2 million Chinese characters of Chinese corpora, 1.9 million English words of corresponding translational corpora, and approximately 20 million English words of non-translational English corpora.

3.2.2 Part-of-Speech Tagging

To prepare the Chinese texts for phrasal alignment, the Chinese statutes in this study were processed by Jseg, an automatic Chinese segmentator modified from Jieba (Sun, as modified by Liu, 2014). Jseg defines “word” boundaries and annotates the texts with POS tags. The program was trained with corpora from the Academia Sinica

Balanced Corpus; algorithms of the Brill Tagger were incorporated to provide a POS-tagging feature trained on corpora from the Sinica Treebank.

For this study, the segmentator was accessed through the web interface of PTT Corpus (http://lopen.linguistics.ntu.edu.tw/PTT/jseg/), a dynamic corpus designed to automatically collect, update, and process data from the bulletin board system PTT (screenshot of PTT Corpus interface shown in Figure 3.4 on p. 48).

POS tagging of English texts in this study, including the English and translational corpora, were performed by the Stanford Part-Of-Speech (POS) Tagger 3.5.1

(Toutanova, Klein, Manning, & Singer, 2003). According to assessments by the developers, the tagger achieves per-position tag accuracy up to 97.24% with a model pre-trained on the Penn Treebank Wall Street Journal (WSJ) Corpus.

The tagset employed for denoting POS category is the Penn Treebank tagset (Santorini, 1990), originally designed for the large annotated corpus Penn Treebank of 4.5 million words in U.S. English (Marcus, Santorini, & Marcinkiewicz, 1993).

Developed based on the Brown Corpus (Francis & Kucera, 1964) tagset, the Penn tagset employs a reduced number of tags by eliminating redundancy, eliminating

inconsistencies, encoding by syntactic functions, and avoiding indeterminacy (allowing for multiple tags). Instead of the original 87 in Brown, the Penn tagset comprises 36 POS tags and 12 tags for punctuation and currency symbols (Marcus et al., 1993). A list of the POS tags is shown in the Appendix.

To process the large quantities of texts in this study, the English Tagger was called in Cygwin environment and set to take each line as a sentence with the option

“-sentenceDelimiter newline,” considering that statutes contain many headings and listed items that are not always marked with line- or sentence-ending punctuation.

在文檔中法律翻譯語料庫建置及分析 (頁 53-57)