Rule-based Named Entity Extraction - Named Entity Extraction and its Results

Chapter 4 Named Entity Extraction and its Results

4.1. Rule-based Named Entity Extraction

Protein entities in training set are composed of core, function and predefined terms.

Core terms show the closest resemblance to regular proper names. Function terms describe the functions or characteristics of a protein. Table 4-1 shows predefined terms namely specifier, amino acid and unit.

The frequent regular expressions of protein entities are ‘C⁺’ and ‘F⁺C⁺’, where

‘C’ is core term and ‘F’ is function term. In SRC and GENIA, the most frequent pattern ‘C⁺’ occupies 64.90% and 58.36% respectively, and ‘F⁺C⁺’ occupies 10.99%

and 5.15%.

Types Description Example Combine numerical characters and a alphabet 1a, 1b, 1c

Number 1, 2, 3

Greek letters alpha, beta, gamma

Amino Acid The 20 amino acids found in proteins Gly, Ala, Val

Unit Units microM, %, UV

Specifier

Table 4-1: The three predefined terms.

We define two types of function terms as head function term and tail function term, depending on the position they appear. In our observation, 58.48% head function terms will appear before an initial uppercase token. For example,

‘transcription factor E2F-1’, ‘transcription factor GATA-4’and ‘transcription factor Sp1’ are three protein entities. Similarly, 74.07% tail function term will appear after an initial uppercase token or a specifier. For example, ‘ACC deaminase’, ‘ACC oxidase’, ‘ACC synthase’ and ‘colicin A immunity protein’, where the shaded terms are function terms. Then we obtain 217 distinct head function terms and 127 distinct tail function terms.

All not explicitly classified are defined as core terms in which continuous English letters are seen as common strings, and these strings are useful for identifying unknown words. For example, a common string ’CD’ is acquired from a core term ’CD23’, and then an unknown word ’CD25’ will be seen as a core term because of its common string is also ‘CD’. In our token list, 3,422 core terms can recover 627 unknown tokens.

Extraction is done by six steps, and first three steps use predefined terms, core terms, and function terms to produce the candidates. If a token is one of the terms, it will be annotated. Adjacent annotated tokens will be seen as a candidate or a chunk, and they will be confirmed or trimmed by step 1 to 3.

Step 1: boundary confirmation

It is impossible for some POS’s to appear on the boundary. We scan the chunk forward (left to right) and backward (right to left) to fix the boundary. The usage of parentheses must be pair-wise, so the irregular parentheses will be removed. For example, a chunk ‘1 ) IL-2’ has to split into ‘1’ and ‘IL-2’. The procedure described as follow:

Ø remove unmatched parentheses.

Ø the chunk’s first POS tag can be one of the set { ‘(‘, ‘CD’, ‘JJ’, ‘NN’,

‘NNS’, ‘VBN’ }.

Ø the chunk’s last POS tag can be one of the set { ‘)‘, ‘CD’, ‘JJ’, ‘NN’, ‘NNS’,

‘VBN’ }.

Ø remove the pair of parentheses when a chunk enclosed.

Step 2: remove invalid single-token chunks

The following conditions are used to check whether single-token chunks are valid or not. If one of the conditions is matched, the chunk will be regarded as invalid and be removed:

Ø The characters of a token are all lower case, and the token is not a protein entity in training data. For example, a token ‘major’ is meaningless.

Ø It is a predefined term. For example, a token ‘12’ cannot represent an entity.

Step 3: remove invalid multi-token chunks

To remove invalid multi-token chunks, it needs more evidence. We propose domain independent rules to filter the chunks. A chunk will be removed if it composes of the followings:

Ø The predefined terms, such as ‘1’, ‘2’ and ‘3’

Ø The single uppercase English letters, such as ‘A’, ‘B’ and ‘C’.

Ø The punctuation marks, such as ‘,’, ‘(‘ and ‘)’

Ø The conjunctions, such as ‘and’ and ‘or’

After the three steps, we remove 68.21% and 52.63% invalid tokens in SRC and GENIA respectively (Table 4-2).

Total Corr. Corr. Rate

SRC 13,451 9,175 9,045 98.58% 68.21%

GENIA 8,846 4,656 4,513 96.93% 52.63%

Remove #

Corpus Invalid # Filter Rate

Table 4-2: The effect of term-based stage (step 1 to 3).

The later three steps aim to acquire precise protein entities as many as possible, so three pattern rules are proposed. Step 4 is to mine the tokens in the preceding and following positions of a protein entity. Fifthly, we want to filter some candidates to boost precision. The sixth one employs syntactic rules to discover some protein names.

The rules are generated by applying statistical information yielded from training set.

Step 4: mine the tokens surrounding protein entities

The pattern is formulated as ‘<T_-2, T_-1, #, T₁, T₂>’, where ‘#’ is token’s number of the protein entity, and the token ’Ti’ is the ith token relative to the protein entity.

Two measurements namely, confidence and occurrence are used to justify the usefulness of the patterns. Confidence means the ratio of the number of correct instances divided by the number of all instances in training data, and occurrence is the number of all instances in training data. Patterns are selected whenever their occurrence and confidence are greater than one and 0.8 respectively, because our system is expected to achieve 80% correct rate, which is the ratio of the number of

correct instances divided by the number of all retrieved instances.

T_-2 T_-1 # T₁ T₂ Confidence Corr. Inst. Occurrence

receptor ( 1 ) . 0.95 20 21

protein ( 1 ) , 0.94 16 17

factor ( 1 ) , 0.90 18 20

protein ( 1 ) . 0.89 8 9

Table 4-3: The examples of the patterns ‘<T_-2, T_-1, #, T₁, T₂>‘.

Step 5: mine the bag-of-word surrounding protein entities

We collect preceding two token and following two token surrounding a protein entity. The non-confidence is used to filter the candidates and it is the number of negative instances divided by the number of all instances. Patterns are recognized whenever non-confidence greater than 0.8, because our system is expected to yield 80% correct rate. Table 4-4 gives some examples. One should notice that the candidate with higher non-confidence should be removed.

Non-Conf. Neg. Inst. Occurrence

of the the 0.91 10 11

in of region the 0.89 8 9

, , cells in 0.80 4 5

. site the to 0.8 4 5

4 bag-of-word

Table 4-4: The examples of the 4 bag-of-word collected from the surroundings of protein entities.

Step 6: employ syntactic rules

Hypernym may appear in front of hyponyms [Hearst, ‘92], and the most common pattern is ‘NP₀ such as {NP₁, NP₂, … , (and|or) } NP_n’. We aim at ‘such as’ and ‘e.g.’

and their preceding tokens which provide important clues: ’… proteins, such as CBL

and VAV, were phosphorylated on … ’, for example. First, we search for ‘such as’, and then the preceding token ‘proteins’ tells us the following is a list of protein names.

Therefore, we can identify the protein names ‘CBL’ and ‘VAV’. In addition to the clue token ‘proteins’, we train a list of preceding tokens while these clue tokens appear in training set (Table 4-5).

activated factors kinases proteases

activation lymphocytes protein

cytokines lymphokines proteins

effectors mediators receptor

enzymes molecule receptors

eosinophils molecules stimulus

isoforms oncoproteins transcription factors

Table 4-5: The list of significant clue tokens.

The model performance is evaluated in terms of precision(P), recall(R) and F-score(F) which is 2PR/(R+P). To present performance of rule-based systems, we use the notations of correct matching defined in [Olsson et al., ‘02]:

Sloppy: Any proposed token matches some tokens of the answer key. For example,

‘CD28’ vs. ‘CD28 surface receptor’.

Protein Name Parts (PNP): Each proposed token matches any token of the answer key. For example, ‘activation of the CD28 surface receptor’ vs. ‘CD surface receptor’.

Strict: The proposed hit matches one answer key exactly. For example, ‘IL-2’ vs.

‘IL-2’.

Boundary:

Left: The leftmost proposed token matches a left boundary in the answer key. For example, ‘CD28’ vs. ‘CD28 surface receptor’.

Right: The rightmost proposed token matches a right boundary in the answer key. For example, ‘activation of the CD28 surface receptor’

vs. ‘CD28 surface receptor’.

Left or Right (LorR): One of the boundaries matches the one of the answer key. This notation is the union of Left and Right.

Table 4-6 shows that the strict measure can yield 51%-52% F-Score. It also shows that the terms, coming from SRC, are adaptable, because the performance in SRC and GENIA are almost the same. Table 4-7 shows the improvement is obvious after steps 1 to 3, but steps 4 to 6 have a little effect. On the other hand, the precision can be boosted obviously but not much for recall.

Notation tp + fn tp + fp tp Recall Precision F-Score SLOPPY 3,234 4,782 2,987 92.36% 62.46% 74.53%

PNP 3,234 4,782 2,859 88.40% 59.79% 71.33%

STRICT 3,234 4,782 2,077 64.22% 43.43% 51.82%

LEFT 3,234 4,782 2,620 81.01% 54.79% 65.37%

RIGHT 3,234 4,782 2,363 73.07% 49.41% 58.96%

LorR 3,234 4,782 2,907 89.89% 60.79% 72.53%

Notation tp + fn tp + fp tp Recall Precision F-Score SLOPPY 3,451 4,923 3,010 87.22% 61.14% 71.89%

PNP 3,451 4,923 2,837 82.21% 57.63% 67.76%

STRICT 3,451 4,923 2,123 61.52% 43.12% 50.70%

LEFT 3,451 4,923 2,765 80.12% 56.16% 66.04%

RIGHT 3,451 4,923 2,296 66.53% 46.64% 54.84%

LorR 3,451 4,923 2,938 85.13% 59.68% 70.17%

SRCGENIA

Table 4-6: The rule-based result in SRC and GENIA.

Procedure tp + fn tp + fp tp Recall Precision F-Score Step 1 3,234 10,480 2,051 63.42% 19.57% 29.91%

Step 2 3,234 5,493 2,043 63.17% 37.19% 46.82%

Step 3 3,234 4,911 2,040 63.08% 41.54% 50.09%

Step 4 3,234 4,977 2,104 65.06% 42.27% 51.25%

Step 5 3,234 4,781 2,077 64.22% 43.44% 51.83%

Step 6 3,234 4,782 2,077 64.22% 43.43% 51.82%

Procedure tp + fn tp + fp tp Recall Precision F-Score

Step 1 3,451 7,911 2,160 62.59% 27.30% 38.02%

Step 2 3,451 5,173 2,129 61.69% 41.16% 49.37%

Step 3 3,451 5,082 2,127 61.63% 41.85% 49.85%

Step 4 3,451 5,164 2,155 62.45% 41.73% 50.03%

Step 5 3,451 4,915 2,120 61.43% 43.13% 50.68%

Step 6 3,451 4,923 2,123 61.52% 43.12% 50.70%

SRCGENIA

Table 4-7: The intermediate results of rule-based approach.

4.2. Statistical and Hybrid Named Entity

在文檔中生物語料中蛋白質名稱之自動辨識 (頁 32-39)