UNCORRECTED
PROOF
1
2
Discovering gene–gene relations from sequential sentence patterns
3
in biomedical literature
4
Jung-Hsien Chiang
a,*, Hsiao-Sheng Liu
b, Shih-Yi Chao
a, Cheng-Yu Chen
a5 aDepartment of Computer Science and Information Engineering, National Cheng Kung University, 1 Da-Shuei Road, Tainan 701, Taiwan 6 bDepartment of Microbiology and Immunology, College of Medicine, National Cheng Kung University, Tainan, Taiwan 7
8 Abstract
9 In this paper, we have developed a gene–gene relation browser (DiGG) that integrates sequential pattern-mining and information-10 extraction model to extract from biomedical literature knowledge on gene–gene interactions. DiGG combines efficient mining technique 11 to enable the discovery of frequent gene–gene sequences even for very long sentences. Our approach aims to detect associated gene rela-12 tions that are often discussed in documents. Integration of the related relations will lead to an individual gene relation network. Graphic 13 presentation will be used to demonstrate the relationships between gene products. A salient feature of this approach is that it incremen-14 tally outputs new frequent gene relations in an online visualization fashion.
15 2006 Published by Elsevier Ltd.
16 Keywords: Text mining; Bioinformatics; Sequential pattern mining; Information extraction; Gene networks
17
18 1. Introduction
19 Now that the Human Genome Project has completely 20 accumulated sequences of human genes, the most challeng-21 ing research has begun. The next step in genome analysis 22 requires not only defining the function of each gene, but 23 also determining the role of its interactions with other 24 genes. In particular, the study of gene–gene interactions 25 forms the basis for understanding the phenomena of acti-26 vation, inhibition, down-regulation, up-regulation, and so 27 on. Gene–gene interaction resources have been collected 28 in databases such as MIPS, EcoCyc, and KEGG, but most 29 are still not cataloged: information about them exists only 30 in scientific literature, which is written in natural language 31 that computers cannot easily understand. Efficient process-32 ing of large amounts of text to obtain this biological
33 knowledge therefore requires sophisticated information
34 extraction methods.
35 A number of methods have been proposed to generate
36 patterns of information extraction in biomedical
docu-37 ments (Marcotte, Xenarios, & Eisenberg, 2001; Ono,
38 2001), for example, hand-coded pattern sets and statistical
39 measures of keywords. Hand-coded pattern sets are based
40 on significant interaction verbs and gene names, for
exam-41 ple, [Protein A interacts with Protein B]. Such patterns yield
42 fairly high precision but low recall, because there are many
43 ways to express biological knowledge in natural language.
44 Manually generated patterns are unreliable because there
45 are many possible linkages between gene terms. Other
46 methods are based on statistical measures of co-occurrence
47 of keywords or gene names. This approach achieves high
48 recall but low precision because it assumes that any pair
49 of genes encountered in the same sentence interact, which
50 is not always true. Many false-positives are thus retrieved
51 because significant interaction keywords and gene names
52 may occur in the same sentences when the genes mentioned
53 are not syntactically related.
0957-4174/$ - see front matter 2006 Published by Elsevier Ltd.
*
Corresponding author. Tel.: +886 6 275 7575x62534; fax: +886 6 274 7076.
E-mail address:[email protected](J.-H. Chiang).
www.elsevier.com/locate/eswa Expert Systems with Applications xxx (2006) xxx–xxx
Expert Systems with Applications
UNCORRECTED
PROOF
55 uments of sentences that describe gene–gene relations: 56 1. ‘‘In vitro experiments demonstrated that MMP-9 was 57 directly inhibited by NAC but was not influenced by 58 TPA.’’ (Anticancer Research, 21(1A), 213–219)
59 2. ‘‘At the same time, PMA induced hyperphosphorylation 60 of MARCKS and talin.’’ (International Journal of Can-61 cer, 75(5), 774–779)
62 3. ‘‘Complex formation with the MDM2 oncogene product is 63 one mechanism inactivating the p53 protein.’’ (European 64 Urology, 32(4), 487–493)
65 4. Balance between activated- STAT and MAP kinase regu-66 lates the growth of human bladder cell lines after treat-67 ment with epidermal growth factor. (International 68 Journal of Oncology, 15(4), 661–667)
69
70 It can be seen from above examples that the syntactic 71 relationships between words can be positive or negative. 72 A positive syntactic relationship (e.g. induce, inhibit, inacti-73 vate, regulate) characterizes the G–G relations in sentence, 74 while a negative one (e.g. not, but, and, nor) signals no or 75 even reversed relations. A syntactic relationship must be 76 positive in order to determine what sort of G–G relation 77 exists. Moreover, active (or passive) description also 78 expresses an ordered sequence. These sequences represent 79 true biological relations in gene products. In this study, 80 we use a sequential-pattern-mining algorithm to identify 81 interaction patterns between genes. In specific, we propose 82 a sequential mining-based hybrid model to mine meaning-83 ful information-extraction rules that delineate the kinds of 84 morphological features that can appear before and after 85 the gene names in sentences describing gene–gene interac-86 tions in documents. This interaction identification tradi-87 tionally demands heavy resources and often includes 88 extensive cross-referencing.
89 2. Methods
90 2.1. System architecture
91 Scientific literature carries much information. To make 92 that information easily and efficiently accessible to 93 researchers, the literature must be computer-readable and 94 causality-interpretable. One way this can be done is by first 95 dividing each document into its constituent sentences and 96 then using a shallow parser to identify the part-of-speech 97 (POS) of each word in individual sentence. The parsing 98 results can then be used as training samples for the subse-99 quent sequential pattern-mining algorithm. The mining 100 stage is for finding candidate frequent sequences within 101 those sentences. The mining algorithm is especially efficient 102 when the sequential gene/relation patterns in the database 103 are complicated.Fig. 1shows a schematic flow diagram of 104 the proposed method, which consists of three components
105 in the DiGG system: the preprocessing stage, the mining
106 stage, and the interpretation stage. The approach can be
107 summarized as follows:
108 1. Tagging the parts of speech and gene names/relations,
109 2. extracting gene–gene interaction rules,
110 3. mining all positive syntactic patterns from the training
111 samples,
112 4. associate candidate sequences,
113 5. display the evidence of possible gene–gene relationships
114 graphically.
115 116 In this section, we discuss the detailed procedures for the
117 proposed framework and briefly describe the developed
118
system, DiGG (Fig. 2). 119
120 2.2. Mining information extraction rules
121 In this study, we are interested in the biologically
122 sequential relations between genes, not in the words used
123 to describe those relations. We therefore need to divide
sen-124 tences into several blocks based on stopwords, gene names,
125 and relational terms. A valid sentence will be transformed
126 into time-sequential data from left to right. Fig. 3
illus-127 trates how a preprocessed training sample is divided into
128 several blocks. Look at the following training-sample
129 sentence:
130 ‘‘IL-6/Gene was/vbd found/vbn to/to decrease/Rlt mdr1/
131 Gene’’.
Tagged Text
.
Mining
Data Data SamplesTraining
2-Sequences Sequences Large Annotated Item-sets Text 2-Patterns Interpretation G-G network Training Phase Test Phase Text Corpus Part-of-Speech Tagger Format Transformation LItemset Phase Gene and Interaction Tagging Gene and Rlt List Association Graph All Possible Sequences Pattern Matching Merging
Phase Gene pairs
Sequential Pattern
Sequential
UNCORRECTED
PROOF
Extraction the Training Samples Applied the rules of the POS Part-of-Speech Tagging
Identification gene names and relation words Divide the corpus into sentences
internet PubMed Text
Corpus
The gpal mutant blocked stable association of Ste4p with the plasma membrane, and the stel8 mutant blocked stable association of Ste4p with both plasma membranes and internal membranes.
The gpal mutant blocked stable association of Ste4p with the plasma membrane, and the stel8 mutant blocked stable association of Ste4p with both plasma membranes and internal membranes.
The/DT gpal/NNP mutant/JJ blocked/BVN stable/JJ association/NN of/IN Ste4p/NNP with/IN the/DT plasma/NN membrane/NN /, and/CC the/DT stel8/JJ mutant/JJ blocked/VBN stable/JJ association/NN of/IN Ste4p/NNP with/IN both/DT plasma/NN membranes/NNS and/CC internal/JJ membranes/NNS ./.
The/DT gpal/NNP mutant/JJ blocked/BVN stable/JJ association/NN of/IN Ste4p/NNP with/IN the/DT plasma/NN membrane/NN
stel8/JJ mutant/JJ blocked/VBN stable/JJ association/NN of/IN Ste4p/NNP with/IN both/DT plasma/NN membranes/NNS and/CC internal/JJ membranes/NNS ./.
Gene JJ Rlt JJ NN IN Gene
Fig. 2. Pre-processing procedures.
Fig. 3. Reforming sentences into tokens in order to process by sequential pattern-mining algorithms.
Fig. 4. Training sample (left) and its extracted patterns for describing gene–gene relations (right).
UNCORRECTED
PROOF
132 According to our gene lexicon, ‘‘IL-6’’ and ‘‘mdrl’’ are 133 marked as gene names; based on a relation dictionary, 134 ‘‘decrease’’ falls into the ‘‘relation (Rlt)’’ category. Those 135 words that are not included in the categories of gene 136 names and relation words will retain their original part-137 of-speech designation, for example, ‘‘vbd’’, ‘‘vbn’’, and 138 ‘‘to’’ above. We then place these training blocks into 139 sequential pattern-mining algorithms (Agrawal & Srikant, 140 1995; Yen & Chen, 1996) to obtain G–G
relation-extrac-141 tion rules. Sequential pattern-mining algorithms not only
142 discover large itemsets (a group of items that appear
143 together), but also identify large sequences (an ordered
144 list of sets of items). The ordered list composes of the
pat-145 terns of gene–gene interactions, describe exactly which
146 gene acts on which gene (or genes). In the following,
147 Fig. 4illustrates an example of extracted relation patterns,
148 and Fig. 5 summarizes our sequential pattern-mining
149 algorithm.
UNCORRECTED
PROOF
150 3. Experimental results
151 We utilized our DiGG system to search arsenic-induced 152 bladder-cancer-related genes and their relations (Simeo-153 nova et al., 2000; Sanchez-Carbayo & Cordon-Cardo, 154 2003; Chiang, Yu, & Hsu, 2004). A total of 9870 corre-155 sponding abstracts were retrieved from the PubMed data-156 base (http://www.pubmedcentral.nih.gov/) The system 157 then automatically identified human gene names from 158 those abstracts to filter out non-relevant documents. We 159 applied the DiGG to these documents and found 48 160 sequential G–G relations through valid sentences. Thirty-161 nine G–G relation descriptions were confirmed as ‘‘cor-162 rect’’ by medical experts from collaborated bio-research 163 laboratory. Examples of correct relation pairs included 164 ‘‘PMA hup-regulatei VEGF’’, ‘‘LPA hinducei ACTIN’’, 165 ‘‘p53hactivatei WAF1’’, etc. Our system achieves the preci-166 sion and recall rates of 81.25% and 73%, respectively for 167 bladder cancer related genes (Table 1).
168 According toStaab (2002), the precision rate of manual
169 information-retrieval techniques applied to biomedical
170 documents may above 80% but with a recall rate of less
171 than 20%. The reasons for this already described above.
172 The recall rate for the DiGG system is reasonable because
173 of our system’s ‘‘fault tolerance’’ in extracting interaction
174 rules. A limitation of our system is that it cannot handle
175 the sort of very long and broadly descriptive terms that
176 have been popularized by the biomedical community. This
177 requires more research. Two false-positive examples in
178 which the DiGG system is unable to identify the gene–gene
179 relations are shown below (Fig. 6).
180 • After incubation with 4-ABP, F-actin decreased and
G-181 actin increased in both cytoplasm and nuclei of PC cells
182 and cytoplasmic F-actin fibers were lost, but only
cyto-183 plasmic actin was altered in the BC cells.
184 • Utilizing ASO directed against the raf-1 gene, a central
185 component of this proposed pathway, we were able to
186 reverse the RR phenotype of human tumor cell lines
187 having elevated HER-2 expression or a mutant form
188 of Ha-ras, two genes upstream of raf-1 in signal
189 transduction. 190 191 4. Conclusion 192 We have developed a system for extracting and
visualiz-193 ing gene relations from biomedical documents. We believe
Fig. 6. Graphic interface of the DiGG system. View #1 lists relations extracted from biomedical documents. When users click on one of those relations, it will be able to read its original description in View #2. The G–G network in View #3 displays valid relations between gene products. When users move the cursor to certain node of a gene in View #3, the system will show related information concerning its cellular component, molecular function and biological Table 1
Test performance obtained from the DiGG system for bladder cancer and CBL and HSP27 Precision (%) Recall (%) Bladder cancer 81.25 73 CBL 76 75 HSP27 79 70 Average 78.75 72.6
UNCORRECTED
PROOF
195 of medical literature data for further analysis, (2) help 196 researchers to understand the inner workings of biological 197 mechanism through gene–gene relations, and (3) offer 198 insights into relationships of genes or proteins via visuali-199 zation of gene networks. To make our system more prom-200 inent to the biomedical research and discovery, it is 201 imperative to allow researchers to observe their experimen-202 tal data such as gene expressions through gene–gene rela-203 tions. Therefore, as part of the enhancements to the 204 system, we can correlate protein–protein interactions with 205 gene expression data to identify unknown biological 206 processes.
207 Acknowledgement
208 This research work was supported in part by Research 209 Grant NSC91-2321-B-006-003 from the National Science 210 Council of the Republic of China.
212
Agrawal, R., & Srikant, R. (1995). Mining sequential patterns. In
213
Proceedings of international conference of data engineering, Taiwan,
214
March.
215
Chiang, J.-H., Yu, H.-C., & Hsu, H.-J. (2004). GIS: a biomedical
text-216
mining system for gene information discovery. Bioinformatics, 20(1),
217
120–121.
218
Marcotte, E. M., Xenarios, L., & Eisenberg, D. (2001). Mining literature
219
for protein–protein interactions. Bioinformatics, 17, 359–363.
220
Ono, T. (2001). Automated extraction of information on protein–protein
221
interactions from the biological literature. Bioinformatics, 17, 155–161.
222
Sanchez-Carbayo, M., & Cordon-Cardo, C. (2003). Applications of array
223
technology: identification of molecular targets in bladder cancer.
224
British Journal of Cancer, 89, 2172–2177.
225
Simeonova, P. et al. (2000). Arsenic mediates cell proliferation and gene
226
expression in the bladder epithelium: association with activating
227
protein-1 transactivation. Cancer Research, 60, 3445–3453.
228
Staab, S. (2002). Mining information for functional genomics. IEEE
229
Intelligent System, 17, 70–73.
230
Yen, S. J., & Chen, A. L. P. (1996). An efficient approach to discovering
231
knowledge from large databases. In 4th International conference on
232
parallel and distributed information systems (PDIS ’96) (pp. 8–18),
233
December 18.