Noisy Filtering - 生物語料中蛋白質名稱之自動辨識

Chapter 3 Preprocessing

3.4. Noisy Filtering

Intuitively, protein entities do not appear in citation, web link, section title and abstract truncated message, so we use heuristic rules to filter noises. To identify citations, the pairs of parentheses and square brackets are seen as candidates. Then a candidate agrees one of the following rules:

Ø If a candidate contains ‘et. al’.

Ø If a candidate contains ‘( n )’, where n is year between 1900 and 2009.

Ø If a candidate contains author names, such as ‘Krupinski, J.’, ‘J. Krupinski’,

‘Bakalyar, H. A.’ and ‘H. A. Bakalyar’.

We recognize web links if a token ‘http’ is found and the next token is ‘:’.

Moreover, the following tokens consist of ‘/’, ‘-’, ‘.’ and alphabet. For instance, three tokens are ‘http’, ‘:’, and ‘//www-genome.wi.mit.edu/’.

Section titles appear at the beginning of a sentence, and tokens of section titles are all capitalized characters. Besides, the ending token of a section title is ‘:’, and an example is ‘FRAGEMENT :’.

Some abstracts are truncated in MEDLINE and given a message. The message looks like ‘( ABSTRACT TRUNCATED AT 250 WORDS )’, where ‘250’ may be another number.

After these processes, we filter 1.55% and 0.42% tokens in SRC and GENIA respectively (Table 3-9).

Noisy Types Tagged Token Percentage Tagged Token Percentage

Citation 10,648 1.44% 1,434 0.29%

Web Link 3 0.00% 6 0.00%

Section Title 403 0.05% 454 0.09%

Abstract Truncated Message 413 0.06% 189 0.04%

SRC GENIA

Table 3-9: The statistic of the number of tokens in the defined regions.

(There are 740,001 and 490,469 tokens in the SRC and GENIA respectively.)

Chapter 4 Named Entity Extraction and its Results

We run our named entity extractors on Microsoft Windows 2000 Professional, and database system is Microsoft SQL Server 2000 Personal Edition. We make comparisons with two corpora namely SRC and GENIA 3.02p, and the corpora are divided into training set (90%) and testing set (10%) individually.

4.1. Rule-based Named Entity Extraction

Protein entities in training set are composed of core, function and predefined terms.

Core terms show the closest resemblance to regular proper names. Function terms describe the functions or characteristics of a protein. Table 4-1 shows predefined terms namely specifier, amino acid and unit.

The frequent regular expressions of protein entities are ‘C⁺’ and ‘F⁺C⁺’, where

‘C’ is core term and ‘F’ is function term. In SRC and GENIA, the most frequent pattern ‘C⁺’ occupies 64.90% and 58.36% respectively, and ‘F⁺C⁺’ occupies 10.99%

and 5.15%.

Types Description Example Combine numerical characters and a alphabet 1a, 1b, 1c

Number 1, 2, 3

Greek letters alpha, beta, gamma

Amino Acid The 20 amino acids found in proteins Gly, Ala, Val

Unit Units microM, %, UV

Specifier

Table 4-1: The three predefined terms.

We define two types of function terms as head function term and tail function term, depending on the position they appear. In our observation, 58.48% head function terms will appear before an initial uppercase token. For example,

‘transcription factor E2F-1’, ‘transcription factor GATA-4’and ‘transcription factor Sp1’ are three protein entities. Similarly, 74.07% tail function term will appear after an initial uppercase token or a specifier. For example, ‘ACC deaminase’, ‘ACC oxidase’, ‘ACC synthase’ and ‘colicin A immunity protein’, where the shaded terms are function terms. Then we obtain 217 distinct head function terms and 127 distinct tail function terms.

All not explicitly classified are defined as core terms in which continuous English letters are seen as common strings, and these strings are useful for identifying unknown words. For example, a common string ’CD’ is acquired from a core term ’CD23’, and then an unknown word ’CD25’ will be seen as a core term because of its common string is also ‘CD’. In our token list, 3,422 core terms can recover 627 unknown tokens.

Extraction is done by six steps, and first three steps use predefined terms, core terms, and function terms to produce the candidates. If a token is one of the terms, it will be annotated. Adjacent annotated tokens will be seen as a candidate or a chunk, and they will be confirmed or trimmed by step 1 to 3.

Step 1: boundary confirmation

It is impossible for some POS’s to appear on the boundary. We scan the chunk forward (left to right) and backward (right to left) to fix the boundary. The usage of parentheses must be pair-wise, so the irregular parentheses will be removed. For example, a chunk ‘1 ) IL-2’ has to split into ‘1’ and ‘IL-2’. The procedure described as follow:

Ø remove unmatched parentheses.

Ø the chunk’s first POS tag can be one of the set { ‘(‘, ‘CD’, ‘JJ’, ‘NN’,

‘NNS’, ‘VBN’ }.

Ø the chunk’s last POS tag can be one of the set { ‘)‘, ‘CD’, ‘JJ’, ‘NN’, ‘NNS’,

‘VBN’ }.

Ø remove the pair of parentheses when a chunk enclosed.

Step 2: remove invalid single-token chunks

The following conditions are used to check whether single-token chunks are valid or not. If one of the conditions is matched, the chunk will be regarded as invalid and be removed:

Ø The characters of a token are all lower case, and the token is not a protein entity in training data. For example, a token ‘major’ is meaningless.

Ø It is a predefined term. For example, a token ‘12’ cannot represent an entity.

Step 3: remove invalid multi-token chunks

To remove invalid multi-token chunks, it needs more evidence. We propose domain independent rules to filter the chunks. A chunk will be removed if it composes of the followings:

Ø The predefined terms, such as ‘1’, ‘2’ and ‘3’

Ø The single uppercase English letters, such as ‘A’, ‘B’ and ‘C’.

Ø The punctuation marks, such as ‘,’, ‘(‘ and ‘)’

Ø The conjunctions, such as ‘and’ and ‘or’

After the three steps, we remove 68.21% and 52.63% invalid tokens in SRC and GENIA respectively (Table 4-2).

Total Corr. Corr. Rate

SRC 13,451 9,175 9,045 98.58% 68.21%

GENIA 8,846 4,656 4,513 96.93% 52.63%

Remove #

Corpus Invalid # Filter Rate

Table 4-2: The effect of term-based stage (step 1 to 3).

The later three steps aim to acquire precise protein entities as many as possible, so three pattern rules are proposed. Step 4 is to mine the tokens in the preceding and following positions of a protein entity. Fifthly, we want to filter some candidates to boost precision. The sixth one employs syntactic rules to discover some protein names.

The rules are generated by applying statistical information yielded from training set.

Step 4: mine the tokens surrounding protein entities

The pattern is formulated as ‘<T_-2, T_-1, #, T₁, T₂>’, where ‘#’ is token’s number of the protein entity, and the token ’Ti’ is the ith token relative to the protein entity.

Two measurements namely, confidence and occurrence are used to justify the usefulness of the patterns. Confidence means the ratio of the number of correct instances divided by the number of all instances in training data, and occurrence is the number of all instances in training data. Patterns are selected whenever their occurrence and confidence are greater than one and 0.8 respectively, because our system is expected to achieve 80% correct rate, which is the ratio of the number of

correct instances divided by the number of all retrieved instances.

T_-2 T_-1 # T₁ T₂ Confidence Corr. Inst. Occurrence

receptor ( 1 ) . 0.95 20 21

protein ( 1 ) , 0.94 16 17

factor ( 1 ) , 0.90 18 20

protein ( 1 ) . 0.89 8 9

Table 4-3: The examples of the patterns ‘<T_-2, T_-1, #, T₁, T₂>‘.

Step 5: mine the bag-of-word surrounding protein entities

We collect preceding two token and following two token surrounding a protein entity. The non-confidence is used to filter the candidates and it is the number of negative instances divided by the number of all instances. Patterns are recognized whenever non-confidence greater than 0.8, because our system is expected to yield 80% correct rate. Table 4-4 gives some examples. One should notice that the candidate with higher non-confidence should be removed.

Non-Conf. Neg. Inst. Occurrence

of the the 0.91 10 11

in of region the 0.89 8 9

, , cells in 0.80 4 5

. site the to 0.8 4 5

4 bag-of-word

Table 4-4: The examples of the 4 bag-of-word collected from the surroundings of protein entities.

Step 6: employ syntactic rules

Hypernym may appear in front of hyponyms [Hearst, ‘92], and the most common pattern is ‘NP₀ such as {NP₁, NP₂, … , (and|or) } NP_n’. We aim at ‘such as’ and ‘e.g.’

and their preceding tokens which provide important clues: ’… proteins, such as CBL

and VAV, were phosphorylated on … ’, for example. First, we search for ‘such as’, and then the preceding token ‘proteins’ tells us the following is a list of protein names.

Therefore, we can identify the protein names ‘CBL’ and ‘VAV’. In addition to the clue token ‘proteins’, we train a list of preceding tokens while these clue tokens appear in training set (Table 4-5).

activated factors kinases proteases

activation lymphocytes protein

cytokines lymphokines proteins

effectors mediators receptor

enzymes molecule receptors

eosinophils molecules stimulus

isoforms oncoproteins transcription factors

Table 4-5: The list of significant clue tokens.

The model performance is evaluated in terms of precision(P), recall(R) and F-score(F) which is 2PR/(R+P). To present performance of rule-based systems, we use the notations of correct matching defined in [Olsson et al., ‘02]:

Sloppy: Any proposed token matches some tokens of the answer key. For example,

‘CD28’ vs. ‘CD28 surface receptor’.

Protein Name Parts (PNP): Each proposed token matches any token of the answer key. For example, ‘activation of the CD28 surface receptor’ vs. ‘CD surface receptor’.

Strict: The proposed hit matches one answer key exactly. For example, ‘IL-2’ vs.

‘IL-2’.

Boundary:

Left: The leftmost proposed token matches a left boundary in the answer key. For example, ‘CD28’ vs. ‘CD28 surface receptor’.

Right: The rightmost proposed token matches a right boundary in the answer key. For example, ‘activation of the CD28 surface receptor’

vs. ‘CD28 surface receptor’.

Left or Right (LorR): One of the boundaries matches the one of the answer key. This notation is the union of Left and Right.

Table 4-6 shows that the strict measure can yield 51%-52% F-Score. It also shows that the terms, coming from SRC, are adaptable, because the performance in SRC and GENIA are almost the same. Table 4-7 shows the improvement is obvious after steps 1 to 3, but steps 4 to 6 have a little effect. On the other hand, the precision can be boosted obviously but not much for recall.

Notation tp + fn tp + fp tp Recall Precision F-Score SLOPPY 3,234 4,782 2,987 92.36% 62.46% 74.53%

PNP 3,234 4,782 2,859 88.40% 59.79% 71.33%

STRICT 3,234 4,782 2,077 64.22% 43.43% 51.82%

LEFT 3,234 4,782 2,620 81.01% 54.79% 65.37%

RIGHT 3,234 4,782 2,363 73.07% 49.41% 58.96%

LorR 3,234 4,782 2,907 89.89% 60.79% 72.53%

Notation tp + fn tp + fp tp Recall Precision F-Score SLOPPY 3,451 4,923 3,010 87.22% 61.14% 71.89%

PNP 3,451 4,923 2,837 82.21% 57.63% 67.76%

STRICT 3,451 4,923 2,123 61.52% 43.12% 50.70%

LEFT 3,451 4,923 2,765 80.12% 56.16% 66.04%

RIGHT 3,451 4,923 2,296 66.53% 46.64% 54.84%

LorR 3,451 4,923 2,938 85.13% 59.68% 70.17%

SRCGENIA

Table 4-6: The rule-based result in SRC and GENIA.

Procedure tp + fn tp + fp tp Recall Precision F-Score Step 1 3,234 10,480 2,051 63.42% 19.57% 29.91%

Step 2 3,234 5,493 2,043 63.17% 37.19% 46.82%

Step 3 3,234 4,911 2,040 63.08% 41.54% 50.09%

Step 4 3,234 4,977 2,104 65.06% 42.27% 51.25%

Step 5 3,234 4,781 2,077 64.22% 43.44% 51.83%

Step 6 3,234 4,782 2,077 64.22% 43.43% 51.82%

Procedure tp + fn tp + fp tp Recall Precision F-Score

Step 1 3,451 7,911 2,160 62.59% 27.30% 38.02%

Step 2 3,451 5,173 2,129 61.69% 41.16% 49.37%

Step 3 3,451 5,082 2,127 61.63% 41.85% 49.85%

Step 4 3,451 5,164 2,155 62.45% 41.73% 50.03%

Step 5 3,451 4,915 2,120 61.43% 43.13% 50.68%

Step 6 3,451 4,923 2,123 61.52% 43.12% 50.70%

SRCGENIA

Table 4-7: The intermediate results of rule-based approach.

4.2. Statistical and Hybrid Named Entity Extraction

The statistical approach is based on HMM. Three models, traditional model, mutual information model and concise model, are examined and a back-off model is also presented. In hybrid extraction system, we put the result of rule-based named entity extraction, and this feature will boost about 1% F-score.

4.2.1. Features Extraction

Internal features indicate those surface clues in tokens (e.g. initial character is upper case), external features indicate the external information associated with tokens (e.g. POS tags), and global features are significant information from training set (e.g.

‘protein’ indicates that a chunk is a protein entity).

Internal features associate with tokens’ characteristics or surface, such as initial upper case and all upper case. These features are shown in Table 4-8, and we use the conjunction of these features. For example, features INIT_UPPER, SUFFIX_NUM, LETTER_DIGITAL, and CONTAIN_HYPHEN will be assigned to ‘BK-2’.

Moreover, we consider features not only current token but also preceding token in HMM.

NO Feature Name Description Example

1 INIT_UPPER The initial character is upper case. BK-2 2 INIT_LOWER The initial character is lower case. c-551 3 INIT_NUM The initial character is number. 5-HT1B 4 INIT_SYMBOL The initial character is symbol. -p1

5 SUFFIX_NUM The suffix is number. MDBP-2-H1

6 CONTAIN_GREEK The token contains Greek letter. 3beta-hydroxysteroid 7 LETTER_DIGITAL There are letters before number. A43

8 TWO_CAPS There are more than two capitalization.RasHua

9 ALL_UPPER All characters are upper case. ALP

10 ALL_LOWER All characters are lower case. bombesin

11 NUM The token is a number. 35 kDa protein

12 OTHER_SINGLE_SYMBOL It is a symbol, but not "- [ ] : ; % ( ) , ." ' 13 CONTAIN_HYPHEN The token contains hyphen. 5-HT1B 14 SINGLE_UPPER The token is a single upper character. A protein

15 CONTAIN_SLASH The token contains slash. C/EBP

Table 4-8: The internal features, and their descriptions and examples.

Besides, we also consider the prefix and suffix string, because they benefit the performance in our studies. We take the most frequent 1,000 three-character prefixes and suffixes strings, and Table 4-9 shows the top 20.

Internal Features Example

Prefix pro, tha, gen, seq, wit, con, fro, res, ami, aci, com, str, pre, the, sub, act, thi, exp, alp, tra Suffix ion, ing, ase, ted, ein, hat, nce, ith, ent, rom,

ine, ate, ity, ene, ide, nal, ins, ons, ino, ain

Table 4-9: The top 20 examples of prefix and suffix strings.

External features are those features extracted not from the components of entities, such as POS tags and BIO tags of rule-based approach. Our classifier can locate protein entities according to POS tags, because tokens of protein entities are normally tagged as nouns. Similarly, the output of rule-based approach is associated with protein entities.

Table 4-10: A sentence and its corresponding external features. (Where ‘R_BIO’ is the result of rule-based named entity extraction.)

Global features are the features extracted from whole training corpus by using statistical method such as Chi-square. Chi-square test is a skill for hypothesis testing of difference. The essence of the test is to compare the observed frequencies with the expected frequencies for independence. The features are usually the significant terms with discrimination to identify the target entities.

To select significant nouns, we use chi-square to measure a token. The simple form of 2-by-2 chi-square test show as following:

) token in protein name, O12 is the number of other tokens in protein name, O21 is the number of specific token not in protein name, and O₂₂ is the number of other tokens not in protein name.

Token Functions of cyclin A1 in the cell cycle and its interactions

POS NNS IN NN NN IN DT NN NN CC PRP$ NNS

R_BIO O O B I O O O O O O O

Token with transcription factor E2F-1 and the Rb family of proteins .

POS IN NN NN NN CC DT NN NN IN NNS .

R_BIO O B I I O O O O O O O

We run complete-link clustering algorithm to group top 500 nouns, and window size is three sentences. Then we reduce dimensions to 214 and 142 clusters in SRC and GENIA respectively.

Global Features Example

Nouns

factor, NF-kappa, B, protein, receptor,

NF-kappaB, IL-2, alpha, factors, transcription, proteins, kinase, receptors, AP-1, kappa, IL-4, I, cytokines, TNF-alpha, domain

Table 4-11: The significant nouns according to chi-square estimation.

4.2.2. HMM Modeling

We apply Bayes’s rule:

( ) ( )

We assume conditional probability independence and consider preceding state:

( )

=∏

( )

and the equation (4-2) can be rewritten:

( ) ( ( ) ( ) )



B) Mutual Information HMM:

A mutual information HMM was presented in [Zhou and Su, ‘02] where F-score are 96.94% and 94.28% in MUC-6 and MUC-7 respectively. Different from traditional HMM, the goal is to maximize the equation:

( ) ( ) ( )

In order to simplify the computation, mutual information independence is assumed:

( )

=∑

( )

Applying it to equation (4-6), we have:

( )

= _^

( )

−∑

( )

+∑

( )

_^

The concise HMM is based on the idea of maximize the fundamental term

(

^S ^T

)

no significant meaning, because the weak probabilities of states and state transitions are merely 3-by-3 and 3-by-1 matrices respectively. Thus, concise HMM is to maximize the equation:

The concise HMM is incorporated with a back-off model. This is because the concise HMM does not consider HMM’s state transition, and a back-off model is a

relaxed probability model whose precision is in decreasing order. Another issue is state transition probability which is the probability of a state transforming into another one, and we should put previous state in the model to ensure correct state induction.

4.2.3. Back-off Modeling

Since our system is based on HMM with many features, it is possible to train a high accuracy probability model. However, it is not enough to cover all data, so the data sparseness problem arises. To overcome this problem, we use a back-off model and it aims at the token sequence T1ⁿ in ^Pr

(

^S¹ⁿ^|^T¹ⁿ

)

^or ^Pr

(

^sⁱ^|^T¹ⁿ

)

^{. T}¹ⁿ represents not only a token sequence but also its internal, external and global features. Then we define two back-off levels:

A) First level is based on different combinations of tokens and their features, and T1ⁿ

will be assigned in the descending order:

1. <s−1,t−1,t0,f0>

2. <s−1,t0,f0>

3. <s−1,t−1,f0>

4. <s−1,f0>

Where ‘ f_i’ represents internal, external and global features. ‘tⁱ’ is a token, ‘sⁱ’ expresses a HMM state, and ‘i’ is the ith one relative to current token.

B) Second level is based on different combinations of features, and ‘ f_i’ in first level is assigned in the descending order:

1. < f_i^I,f^E_i ,f^G_i >

2. < f_i^I,f^E_i >

3. < f_i^I>

Where f_i^I , f^E_i and f^G_i represent internal, external and global features respectively.

We implemented traditional, mutual information, and concise ones. Then we use same back-off models within concise and mutual information HMM, but not traditional HMM. Table 4-12 shows that concise HMM with rule-based features (i.e.

Concise-Hybrid HMM) can yield the best result. Traditional HMM also obtains good F-score, but the recall is not good enough. The reason is that we choose a severe probability model to get the best F-score. The performance of mutual information HMM is the worst, because the back-off model is to optimize concise HMM.

HMM tp + fn tp + fp tp Recall Precision F-Score Concise 3,234 2,953 2,355 72.82% 79.75% 76.13%

Concise - Hybrid 3,234 2,949 2,391 73.93% 81.08% 77.34%

MI 3,234 3,439 2,384 73.72% 69.32% 71.45%

Traditional 3,234 2,396 2,086 64.50% 87.06% 74.10%

HMM tp + fn tp + fp tp Recall Precision F-Score Concise 3,451 3,285 2,553 73.98% 77.72% 75.80%

Concise - Hybrid 3,451 3,323 2,596 75.22% 78.12% 76.65%

MI 3,451 3,415 2,305 66.79% 67.50% 67.14%

Traditional 3,451 2,863 2,263 65.58% 79.04% 71.68%

SRCGENIA

Table 4-12: The comparison between HMM models.

Table 4-13 shows every feature has positive effect (f ^E > f ^I > f ^G) in concise HMM, because F-score becomes lower if we subtract any feature. Moreover, concise HMM relies on back-off model because features have slight influence on F-score.

Features tp + fn tp + fp tp Recall Precision F-Score Diff.

All 3234 2949 2391 73.93% 81.08% 77.34%

All - f^G 3234 2948 2372 73.35% 80.46% 76.74% -0.60%

All - f^E 3234 2888 2319 71.71% 80.30% 75.76% -1.58%

All - f^I 3234 2943 2342 72.42% 79.58% 75.83% -1.51%

Features tp + fn tp + fp tp Recall Precision F-Score All 3451 3323 2596 75.22% 78.12% 76.65%

All - f^G 3451 3304 2576 74.65% 77.97% 76.27% -0.38%

All - f^E 3451 3231 2483 71.95% 76.85% 74.32% -2.33%

All - f^I 3451 3250 2511 72.76% 77.26% 74.94% -1.70%

SRCGENIA

Table 4-13: The effects of features in concise HMM.

4.3. Systems Comparison 4.3.1. Experiment with SRC

In SRC, we compare our systems with KeX and Yapex. Not only our rule-based approach but also statistical and hybrid ones are better than the two systems. Table 4-14 shows the result, and the unit is F-score. KeX and Yapex yield not bad results in PNP notation because rule-based systems have little ability to identify correct boundaries.

Notation KeX Yapex Rule-based Statistical Hybrid Sloppy 54.85% 60.78% 74.53% 89.12% 89.79%

PNP 48.21% 52.49% 71.33% 80.78% 81.97%

Strict 19.07% 32.37% 51.82% 76.13% 77.34%

Left 32.29% 40.78% 65.37% 85.34% 86.14%

Right 29.53% 47.78% 58.96% 78.36% 79.51%

LorR 42.75% 56.17% 72.53% 88.06% 88.86%

Our System

Table 4-14: Comparison between rule-based systems in SRC.

4.3.2. Experiment with GENIA Corpus

We run our systems on GENIA version 1.1 and latest version 3.02p published on 19 Aug. 2003. GENIA version 3.02p is based on version 3.0 but errors are fixed.

Therefore, we compare with systems, whose GENIA version at least 3.0, developed by Lee et al. [‘03] and Shen et al. [‘03]. Table 4-15 shows that our hybrid approach can yield the best F-score in strict notation. Besides, we can yield a good precision due to incorporation of severe probability models and back-off models. Consider rule-based approaches, KeX and Yapex can yield 72.32% and 65.88% F-score in PNP notation respectively, because it is difficult for rule-based approaches to identify boundaries.

System Method GENIA Ver. Recall Precision F-score

(Lee, 2003) SVM 3.0p 78.80% 61.70% 69.20%

(Shen, 2003) HMM 3.0 70.81%

KeX Rule-based 3.02p 43.67% 37.40% 40.29%

Yapex Rule-based 3.02p 45.06% 50.17% 47.48%

Rule-based 61.52% 43.12% 50.70%

HMM 73.98% 77.72% 75.80%

Hybrid 75.22% 78.12% 76.64%

Our System 3.02p

Table 4-15: Comparison with other systems in GENIA version 3.x.

There are 671 abstracts in GENIA version 1.1, and 80 abstracts are selected to be testing set [Kazama et al., ’02; ‘03]. We compare with Kazama’s systems, and we yield a better result. Compare with the performance in GENIA version 3.02p, and we see one in version 3.02p is better. Because the training set in version 3.02p is larger than that in version 1.1.

System Method Recall Precision F-score

在文檔中生物語料中蛋白質名稱之自動辨識 (頁 30-0)