Query Expansion - Web-based Term Translation Combining Naming Rules

Chapter 3 Web-based Term Translation Combining Naming Rules

3.2 Query Expansion

After a name entity is labeled, we retrieve Chinese search results and apply different query expansion strategies by its category label. Then search results with query expansion and search results without query expansion are used for extracting translation candidates.

We would like to retrieve more results which contain correct translations by query expansion. Because frequency is an important factor for our translation extraction method, it is helpful if the proportion of results which contain translation to all retrieved result is high. The best way of query expansion is adding translation with the named entity [Yang et al., 2009]. Since we do not know translation of the named entity, the alternative is adding part translation. For example, Figure 3 and Figure 4 show the difference between expanding query of “Clouds of Witness” and results without expanding query. It clearly shows that query expansion helps retrieve more results which contain translation. We will describe our approach of query expansion in detail in the following section.

Figure 3. Top five results without expanding query of clouds of witness

Figure 4. Top five results with expanding query of clouds of witness

3.2.1 Query Expansion of Book and Movie titles

We utilized retrieved Chinese search results to generate query expansions. Based on our observation, some vocabularies in the title were translated directly. For example, the word “Clouds” in “Clouds of Witness” is translated to “雲” (證言疑雲) directly, and the word “tomorrow” of “the day after tomorrow” is translated to “明天” (明天過後).

Therefore, we would like to use this observation to help us find more useful search results.

Because translations usually co-occur with the named entity together, we extracted Chinese patterns near the named entity and aimed to extract part of translation from results by following steps:

1. We translate vocabularies in the title word by word by looking up BOW³. For example, “Clouds of Witness” was translated to {雲, 幻覺, 嫌疑,目擊者, 證人}.

2. We extract Chinese patterns near the title. For example, if snippet is ”桃樂絲．榭爾絲, Clouds of Witness 證言疑雲”, then 證言疑雲 and 榭爾絲 would be extracted,

3http://bow.sinica.edu.tw/

because they are the nearest Chinese patterns to title “Clouds of Witness”.

3. We sort extracted Chinese patterns by frequency.

4. We selected pattern which contain translation and has highest frequency. For example, we have patterns {介紹: 10 , 中文: 8 , 言疑雲：5, 疑雲:3}, because “言疑雲”

contained “雲” and had highest frequency among patterns which contain translation,

“言疑雲” was selected.

However, query drift would happen if we select an inappropriate query expansion term. To avoid query drift, we added clue words to prevent drift problem. A clue word is a Chinese pattern that often co-occurs with the named entity and translation of the named entity. To get clue words for each category, we submit training instances with its translation to Google. We then extract 2-gram to 4-gram of Chinese patterns from returned results and keep patterns whose average frequencies larger than 0.5 per named entity in the category. We select an appropriate clue word by following equation:

𝑐

𝑓_𝑎𝑥 × 𝑝𝑟𝑜𝑏(𝑐|𝐶𝑎) (4)

where Ca is a category, c is a clue word, and 𝑓_𝑎𝑥 is the frequency of the most frequent clue word in the search result. After part of translation and clue word are extracted, we submit them with the title to get Chinese search results. In the above case, we submit “言疑雲” and “小說” with “clouds of witness” to Google and retrieve search results.

3.2.2 Query Expansion of Medicine Names

The query expansion method of book or movie titles is not suitable for medicine names.

We observe that medicine names are usually transliterated, and Chinese characters which correspond to English syllables are fixed. Based on these observations, we proposed a naming rule based method to generate part translation. We collect English

syllables to Chinese character pairs from collected translation pairs of training instances by the following steps:

1. We split medicine name to English syllables, e.g. “Setazindol” → ”se”, “ta”,

“zin”, “do”, “l”

2. Every syllables mapped to every Chinese characters in the correspondent translation, and thus generate m × n pairs, where m is number of syllables and n is number of Chinese characters. E.g. translation of “Setazindol” is “司他秦多”, and they generate pairs like (se, 司), (se , 他), (se , 秦), (se , 多), and so

We expanded medicine named entities by following steps:

1. We split name to syllables and extracted correspondent Chinese character of these syllables with highest confidence. E.g. pairs of “Setazindol” with highest value are (“se”, “酶”, 0.192), (“ta”, “他”, 0.405), and (“do”, “多”, 0.325) 2. Selected the characters of two consecutive syllables with highest sum. E.g. the

two consecutive syllables of “Setazindol” are “seta”, “tazin”, ”zindo”, and

“dol”. Syllable “seta” has highest sum 0.597, so we selected “酶他” as query expansion.

3. If there did not exist any correspondent character, we select clue word by equation (4).

3.2.3 Query Expansion of Company Names

Expanding company names are similar to expanding medicine names. We find out that the proper nouns or location names of a company name are transliterated and the others are usually translated. Furthermore, we observe that some vocabularies are always translated to the same Chinese words. For example, “food” is usually translated to “食品”, and “trade” is usually translated to “貿易”. Therefore, we could generate appropriate translation part from English token to Chinese token pairs according to naming rule. A English token to Chinese token pair is like (“co ltd”, “有限公司”). It is obtained from company name in translation pairs of training instances. We obtain pairs like following steps:

1. English company name was split to one and two grams.

2. Correspondent translation was split to two to four grams.

3. Every English grams map to every Chinese grams as a pair, and thus generate m × n pairs, where m is number of English grams and n is number of Chinese grams. E.g. “JIUJIANG HUIYUAN FOOD STUFF CO., LTD” and “九江匯源食品飲料有限公司” generate 330 pairs.

4. We count confidence of each pair by equation (5) and retain pairs which confidence is larger than 0.01.

we describe query expansion method of company name as following steps:

1. We split English company name to one and two grams and retrieved correspondent pair with highest confidence. E.g. pairs of “九江匯源食品飲料有限公司” are {jiujiang : 九江 , food : 食品 , ltd : 有限 , co ltd : 有限公司}

2. We selected Chinese pattern in the pairs which has highest frequency in the search result. E.g. The frequencies of each Chinese patterns are {九江: 15, 食品: 5, 有限:10, 有限公司: 8}, then we choose “九江” as query expansion.

在文檔中以網路為主之英對中專有名詞翻譯萃取 (頁 23-28)