Related Work - 利用網路探勘之中英專名萃取研究

In this chapter, we will briefly describe some researches of automatic term translation.

The methods are classified into three categories according to the corpus they used:

1. Parallel/comparable corpus-based method [Nie et al. 1999; Shao et al. 2004;

Lee et al. 2005],

2. Bilingual dictionary-based method [Gao et al. 2001; Seo et al. 2005], 3. Web-based method [Lu et al. 2001; Lu et al. 2004; Zhang et al. 2004;

Huang et al. 2005; Wu et al. 2005; Fang et al. 2005; Wang et al. 2006]

2.1 Parallel/Comparable Corpus-Based Method

A parallel corpus is a collection of sentence pairs with the same meaning but in different languages. Nie et al. [1999] proposed a method to automatically gather parallel texts from the Web based on anchor texts, hypertexts, webpage names, and HTML structure. They used a probabilistic model to extract translations from parallel texts they gathered. The core of the model is the probability p(t|s), the probability of having a word t in the translation of a sentence containing a word s. However, for language pairs other than English-French in their case, the amount of parallel documents on the Web might not always be enough. Lee et al. [2005] proposed a model for extracting proper names and corresponding translations from parallel corpus. They proposed statistical transliteration model P(C|E) to calculate the probability between English proper name and Romanized transliteration of Chinese terms. The parameters of the model are automatically learned from a bilingual proper name list using the EM algorithm. Experimental results show that the average rates of word and character precision are 93.8% and 97.8%, respectively

A comparable corpus consists of a first-language corpus and a second-language corpus of the same domain. Shao et al. [2004] proposed a method to mine new word translations from comparable corpora, by combining context and transliteration information. They exploit language modeling approach P(Q|D) to extract translation on the basis of context information. They experimented six month of Chinese and English Gigaword corpora. They got about 78% precision and about 32% recall.

2.2 Bilingual Dictionary-Based Method

Dictionary-based method is a widely used approach in term translation, because of its simplicity and the increasing availability of readable dictionaries. In this method, the major task is word sense disambiguation, because one query term maybe has multiple translation equivalents in the bilingual dictionary. Gao et al. [2001] used statistical models to overcome this problem. First, they recognized and translated the noun phrases by using statistically models and phrase translation patterns. Second, they selected the best translations based on the cohesion between translation words. The cohesion is term similarity measured by EMMI proposed by [Van Rijsbergen 1979].

However, it is difficult to obtain sufficient amount of word/phrase-aligned parallel corpus so as to extract phrase translation patterns is difficult. Seo et al. [2005]

proposed new translation selecting model, they first generated all possible candidate translation queries, and then calculated similarity scores among the terms in each translation candidate query respectively. This method attempts to get target query in which translation equivalents have strong relations with each other. However, proper nouns are not often included in bilingual dictionaries. Thus, it is difficult to handle translation only via dictionaries.

2.3 Web-Based Method

The researches based on Web resources focus on two parts, anchor texts and search-result pages. Lu et al. [2001a, 2001b] extracted translation pairs from anchor texts pointing to the same webpage. They first collected anchor-text-set of a Web page.

For a query term, they found its translation terma if terma is written in the target language and frequently co-occurs with the source term in the same anchor-text sets.

They employed Probabilistic Inference Model to extract translation of query term.

They experimented 622 English query terms, and get about 57% accuracy. However, not every pair of languages contains sufficient anchor texts for effective extraction of translations for Web queries. To deal with this problem, Lu et al. [2004] proposed transitive translation model, the translations of a query term can be extracted via its translation in an intermediate language. They further exploit Competitive Linking Algorithm to reduce interference from translation errors. The experiments showed that the approach is particularly useful when the considered language pair lack of sufficient anchor texts.

There are many researches focus on search-result pages. Zhang et al. [2004]

extracted translation of query term from search-result pages. First, they detected potential Chinese out-of-vocabulary terms based on Hidden Markov Model and term co-occurrence. First, they submitted Chinese out-of-vocabulary terms to search engine, and get top-100 Chinese snippets. Second, they extracted translation candidates that occurred immediately proceeding/succeeding the Chinese out-of-vocabulary. Final, they ranked translation candidates by their lengths, and frequencies. Wang et al. [2006]

proposed a Web-based approach for dealing with the translation of unknown query terms for cross-language information retrieval in digital libraries. The proposed new

association measurement, called SCPCD, combines the symmetric conditional probability [Silva et al. 1999] with the concept of context dependency [Chien 1997] of the n-gram. They use the new formula to extract translation candidates based on the frequencies of its substrings and the number of its unique left and right adjacent words or characters. Finally, they linear combine the Chi-Square Test [Gale et al. 1991] and Context Vector Analysis to rank translation candidates. The experiments showed that they can effectively translate unknown terms.

In order to improve performance of translation, a number of effective techniques have been proposed. Fei Huang et al. [2005] used query expansion phase in order to get more related snippets and used combination of transliteration, translation, and frequency-distance models to rank translation candidates. First, they extracted expansion candidates from returned snippets by querying source query terms. They prepared a dictionary to translate expansion candidates and used rules to filter out some irrelevant terms. Finally, they extracted top frequency terms as expansion terms.

In experiments, they achieve 80% accuracy with 165 snippets. Fang et al. [2005] used character-based string frequency estimation to gather translation candidates. They defined two kinds of candidate noises: subset redundancy information and prefix/suffix redundancy information. The subset redundancy information is that the terma is a subset of another termb, but the rank of terma is lower than termb. The prefix/suffix redundancy information is terma is the prefix or suffix of termb, but rank of terma is greater than termb. They proposed sort-based subset deletion and mutual information methods to deal with these two noise information respectively. After removing candidate noise, we can rank remain candidates and get better results. They experimented 401 English terms, and get about 72% accuracy.

Additionally, Wu et al. [2005] proposed a TermMine system. In this system, they used surface patterns which are learned by a list of bilingual terms to extract translation candidates more exact. Surface pattern means the co-occurring format between source query and its translation. For example, we submit “Picasso” and “畢卡索” to search engine, and we get some texts as follow:

“…Picasso (畢卡索)…” and “…畢卡索Pablo Picasso…”.

We can extract surface patterns “E(C” and “CwE”, in which E is source English word from bilingual list, C is translation of E, w is any other English word, and others are punctuations.

They are first submitted bilingual pairs to search engine and extracted surface patterns from the search-result pages. Translation candidates are extracted if they matched the surface patterns. Finally, they rank these translation candidates based on frequencies or probability calculated by transliteration model. They experimented 300 English terms, and get 86 % accuracy.

在文檔中利用網路探勘之中英專名萃取研究 (頁 14-19)