Named Entity Recognition - Web-based Term Translation Combining Naming Rules

Chapter 3 Web-based Term Translation Combining Naming Rules

3.1 Named Entity Recognition

At first, we retrieve source language search results and use source language results for NER. In NER phase, because terms may belong to multiple categories, it is not suitable that we limit named entities only belonging to one category. For example, novels are usually made in to a film. To allow named entities belong to multiple categories, we trained a Boolean identifier for each category and identified named entities by these identifiers.

To identify an unknown named entity, we first submit the term to Google¹ and retrieve top 100 results in source language (English). After results are collected, html tags are cleared, and snippets are tagged part-of-speech (POS) tag by Natural Language Toolkit (NLTK²). Then the cleared and tagged results are sent to category identifier.

Features which are used for identifier can be categorized to three types: syntactic features, word usage of intra-category features (WIC), and word distribution among inter-categories features (WAC). We describe each type in the following section.

3.1.1 Syntactic Features

Syntactic features exploit syntactic information of retrieved results, namely, part of speech (POS) tag patterns, relative position in titles, and relative position in snippets.

The three features are shown as Table 1.

Table 1. Syntactic-related features focus on syntactic information of results

Syntactic

POS(t) ^∑𝑝𝑎𝑡∈𝑝^{𝑝(𝑝𝑎𝑡|𝑐)} p: set of POS tag patterns which co-occur near the named entity within 3 words, c:category (book, movie, medicine, company)

Tpos(t) Related position of term in the title. 0 indicates at start of sentence, and 1 indicates at end of sentences

Spos(t) Related position of term in the title. 0 indicates at start of sentence, and 1 indicates at end of sentences

To compute value of feature POS, we summed up probability of POS tag patterns which are around the named entity in three words. The probability of POS tag patterns were counted from collected results of training instances. The probability of a POS pattern belonging to category c was counted by following equation:

1 http://www.google.com

2 http://nltk.org/

p(𝑝𝑎𝑡_𝑖|c) = 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦(𝑝𝑎𝑡_𝑖 , 𝑐)

∑_𝑝𝑎𝑡_𝑗_{∈ 𝑝}𝑓𝑟𝑒𝑞𝑢𝑛𝑐𝑦 (𝑝𝑎𝑡_𝑗, 𝑐) (1)

where frequency(pat, c) is the frequency of pattern pat appear in category c, and p represents the counts of all patterns appearing in category c.

Feature Tpos and Spos represent relative position of the named entity appear in search result titles and snippets respectively. We considered that positions of named entities might vary from categories, for example, a book title or a movie title usually occurs in first position of a snippet, and a company name usually occurs in middle of a snippet. The positions of a named entity occurred in titles or results were represented between 0 and 1, 0 represents at the start of the titles or snippets, and 1 represents at the end of the titles or snippets, then we averaged these position values.

3.1.2 Word usage of Intra-category

Consider that frequently used words vary between categories, so we exploited words usage information of retrieved results. And because usage of verbs and adjectives vary from categories too, we proposed five features that exploited information of word usage, verb usage in results, usage of verbs around the named entity, adjective usage, and usage of adjectives around the named entity. The features are shown as Table 2.

Table 2. Features of considering word usage inside category

Word usage of intra-category

AWIC(t) Usage of all words in retrieved results AVIC(t) Usage of all verbs in retrieved results AJIC(t) Usage of all adjectives in retrieved results

CVIC(t) Usage of verbs around the named entity within three words CJIC(t) Usage of adjectives around the named entity within three words

AWIC feature considers the word distribution of a term. Weight of each extracted word was counted by equation as follows:

∑ 𝑡𝑓(𝑤𝑖)

∑𝑥_𝑗∈𝑤𝑡𝑓(𝑤𝑗) × 𝑝𝑟𝑜𝑏(𝑥_𝑖|𝑐)

𝑥_{𝑖 ∈𝑤}

(2) where tf(𝑤_𝑖) represents term frequency of 𝑤_𝑖 in retrieved search results, w represents set of English vocabularies which appear in search result except stop words, prob(𝑥_𝑖|c) represents proportion of word 𝑥_𝑖 in category c. prob(𝑥_𝑖|c) is counted by equation (1) from search results of training instance. And

To consider verb usage in returned result, we considers verb usage in returned results in feature AVIC, and feature CVIC considers that verb appeared around the named entity in window size. Similar to features AVIC and CVIC, features AJIC and CJIC consider adjective usage in returned results. Feature AJIC considers usage of all adjectives in returned results, and feature CJIC considers usage of adjectives that appeared around the named entity in window size of 3 words.

3.1.3 Word Distribution among Inter-categories

These features aimed to complement word usage of intra-category (WIC) features. The disadvantage of WIC features were that they did not consider common words. For example, “do” is a frequently used word in all categories, since WIC features only compute proportion of a word inside a category, common words got high value in all categories, thus causing confusing to identify the named entity. To address this problem, we utilized word distribution among inter-categories (WAC) features to complete WIC.

WAC features consider word distribution among categories, so if a word is a common word, it would get low value. The features are shown as Table 3.

Table 3. Features consider word distribution among categories

Word distribution among inter-categories

AWAC(t) Word distribution probabilities among categories

AVAC(t) Distribution probabilities of verbs among categories

AJAC(t) Distribution probabilities of adjectives among categories

CVAC(t) Distribution probabilities of verbs around the named entity within 3 words

CJAC(t) Distribution probabilities of verbs around the named entity within 3 words

The value of each feature was counted as equation (3):

a ue = ∑ 𝑡𝑓(𝑥_𝑖)

AVAC is similar to AWAC, the difference is that we care verbs distribution among categories instead of all English vocabularies. To compute value of feature AVAC, we extracted vocabularies which tagged as a verb from returned results. And value of AVAC is counted by equation (3).

AJAC focused on adjectives that appeared in search results. We consider that adjective usage is an important factor to judge categories. CVAC and CJAC focus on verbs and adjectives around the named entity in window size of 3 words. Since verbs or adjectives near the named entity might be modifier or action of the entity, they also

provide information for categorization.

在文檔中以網路為主之英對中專有名詞翻譯萃取 (頁 18-23)