Discovering gene-gene relations from sequential sentence patterns in biomedical literature

(1)

UNCORRECTED

PROOF

1

2

Discovering gene–gene relations from sequential sentence patterns

3

in biomedical literature

4

Jung-Hsien Chiang

a,*

, Hsiao-Sheng Liu

b

, Shih-Yi Chao

a

, Cheng-Yu Chen

a

5 a_{Department of Computer Science and Information Engineering, National Cheng Kung University, 1 Da-Shuei Road, Tainan 701, Taiwan} 6 b_{Department of Microbiology and Immunology, College of Medicine, National Cheng Kung University, Tainan, Taiwan} 7

8 Abstract

9 In this paper, we have developed a gene–gene relation browser (DiGG) that integrates sequential pattern-mining and information-10 extraction model to extract from biomedical literature knowledge on gene–gene interactions. DiGG combines eﬃcient mining technique 11 to enable the discovery of frequent gene–gene sequences even for very long sentences. Our approach aims to detect associated gene rela-12 tions that are often discussed in documents. Integration of the related relations will lead to an individual gene relation network. Graphic 13 presentation will be used to demonstrate the relationships between gene products. A salient feature of this approach is that it incremen-14 tally outputs new frequent gene relations in an online visualization fashion.

15 2006 Published by Elsevier Ltd.

16 Keywords: Text mining; Bioinformatics; Sequential pattern mining; Information extraction; Gene networks

17

18 1. Introduction

19 Now that the Human Genome Project has completely 20 accumulated sequences of human genes, the most challeng-21 ing research has begun. The next step in genome analysis 22 requires not only defining the function of each gene, but 23 also determining the role of its interactions with other 24 genes. In particular, the study of gene–gene interactions 25 forms the basis for understanding the phenomena of acti-26 vation, inhibition, down-regulation, up-regulation, and so 27 on. Gene–gene interaction resources have been collected 28 in databases such as MIPS, EcoCyc, and KEGG, but most 29 are still not cataloged: information about them exists only 30 in scientific literature, which is written in natural language 31 that computers cannot easily understand. Efficient process-32 ing of large amounts of text to obtain this biological

33 knowledge therefore requires sophisticated information

34 extraction methods.

35 A number of methods have been proposed to generate

36 patterns of information extraction in biomedical

docu-37 ments (Marcotte, Xenarios, & Eisenberg, 2001; Ono,

38 2001), for example, hand-coded pattern sets and statistical

39 measures of keywords. Hand-coded pattern sets are based

40 on signiﬁcant interaction verbs and gene names, for

exam-41 ple, [Protein A interacts with Protein B]. Such patterns yield

42 fairly high precision but low recall, because there are many

43 ways to express biological knowledge in natural language.

44 Manually generated patterns are unreliable because there

45 are many possible linkages between gene terms. Other

46 methods are based on statistical measures of co-occurrence

47 of keywords or gene names. This approach achieves high

48 recall but low precision because it assumes that any pair

49 of genes encountered in the same sentence interact, which

50 is not always true. Many false-positives are thus retrieved

51 because signiﬁcant interaction keywords and gene names

52 may occur in the same sentences when the genes mentioned

53 are not syntactically related.

0957-4174/$ - see front matter 2006 Published by Elsevier Ltd.

*

Corresponding author. Tel.: +886 6 275 7575x62534; fax: +886 6 274 7076.

E-mail address:[email protected](J.-H. Chiang).

www.elsevier.com/locate/eswa Expert Systems with Applications xxx (2006) xxx–xxx

Expert Systems with Applications

(2)

UNCORRECTED

PROOF

55 uments of sentences that describe gene–gene relations: 56 1. ‘‘In vitro experiments demonstrated that MMP-9 was 57 directly inhibited by NAC but was not inﬂuenced by 58 TPA.’’ (Anticancer Research, 21(1A), 213–219)

59 2. ‘‘At the same time, PMA induced hyperphosphorylation 60 of MARCKS and talin.’’ (International Journal of Can-61 cer, 75(5), 774–779)

62 3. ‘‘Complex formation with the MDM2 oncogene product is 63 one mechanism inactivating the p53 protein.’’ (European 64 Urology, 32(4), 487–493)

65 4. Balance between activated- STAT and MAP kinase regu-66 lates the growth of human bladder cell lines after treat-67 ment with epidermal growth factor. (International 68 Journal of Oncology, 15(4), 661–667)

69

70 It can be seen from above examples that the syntactic 71 relationships between words can be positive or negative. 72 A positive syntactic relationship (e.g. induce, inhibit, inacti-73 vate, regulate) characterizes the G–G relations in sentence, 74 while a negative one (e.g. not, but, and, nor) signals no or 75 even reversed relations. A syntactic relationship must be 76 positive in order to determine what sort of G–G relation 77 exists. Moreover, active (or passive) description also 78 expresses an ordered sequence. These sequences represent 79 true biological relations in gene products. In this study, 80 we use a sequential-pattern-mining algorithm to identify 81 interaction patterns between genes. In speciﬁc, we propose 82 a sequential mining-based hybrid model to mine meaning-83 ful information-extraction rules that delineate the kinds of 84 morphological features that can appear before and after 85 the gene names in sentences describing gene–gene interac-86 tions in documents. This interaction identiﬁcation tradi-87 tionally demands heavy resources and often includes 88 extensive cross-referencing.

89 2. Methods

90 2.1. System architecture

91 Scientific literature carries much information. To make 92 that information easily and efficiently accessible to 93 researchers, the literature must be computer-readable and 94 causality-interpretable. One way this can be done is by first 95 dividing each document into its constituent sentences and 96 then using a shallow parser to identify the part-of-speech 97 (POS) of each word in individual sentence. The parsing 98 results can then be used as training samples for the subse-99 quent sequential pattern-mining algorithm. The mining 100 stage is for finding candidate frequent sequences within 101 those sentences. The mining algorithm is especially efficient 102 when the sequential gene/relation patterns in the database 103 are complicated.Fig. 1shows a schematic flow diagram of 104 the proposed method, which consists of three components

105 in the DiGG system: the preprocessing stage, the mining

106 stage, and the interpretation stage. The approach can be

107 summarized as follows:

108 1. Tagging the parts of speech and gene names/relations,

109 2. extracting gene–gene interaction rules,

110 3. mining all positive syntactic patterns from the training

111 samples,

112 4. associate candidate sequences,

113 5. display the evidence of possible gene–gene relationships

114 graphically.

115 116 In this section, we discuss the detailed procedures for the

117 proposed framework and brieﬂy describe the developed

118

system, DiGG (Fig. 2). 119

120 2.2. Mining information extraction rules

121 In this study, we are interested in the biologically

122 sequential relations between genes, not in the words used

123 to describe those relations. We therefore need to divide

sen-124 tences into several blocks based on stopwords, gene names,

125 and relational terms. A valid sentence will be transformed

126 into time-sequential data from left to right. Fig. 3

illus-127 trates how a preprocessed training sample is divided into

128 several blocks. Look at the following training-sample

129 sentence:

130 ‘‘IL-6/Gene was/vbd found/vbn to/to decrease/Rlt mdr1/

131 Gene’’.

Tagged Text

.

Mining

Data Data SamplesTraining

2-Sequences Sequences Large Annotated Item-sets Text 2-Patterns Interpretation G-G network Training Phase Test Phase Text Corpus Part-of-Speech Tagger Format Transformation LItemset Phase Gene and Interaction Tagging Gene and Rlt List Association Graph All Possible Sequences Pattern Matching Merging

Phase _{Gene pairs}

Sequential Pattern

Sequential

(3)

UNCORRECTED

PROOF

Extraction the Training Samples Applied the rules of the POS Part-of-Speech Tagging

Identification gene names and relation words Divide the corpus into sentences

internet _PubMed Text

Corpus

The gpal mutant blocked stable association of Ste4p with the plasma membrane, and the stel8 mutant blocked stable association of Ste4p with both plasma membranes and internal membranes.

The/DT gpal/NNP mutant/JJ blocked/BVN stable/JJ association/NN of/IN Ste4p/NNP with/IN the/DT plasma/NN membrane/NN /, and/CC the/DT stel8/JJ mutant/JJ blocked/VBN stable/JJ association/NN of/IN Ste4p/NNP with/IN both/DT plasma/NN membranes/NNS and/CC internal/JJ membranes/NNS ./.

The/DT gpal/NNP mutant/JJ blocked/BVN stable/JJ association/NN of/IN Ste4p/NNP with/IN the/DT plasma/NN membrane/NN

stel8/JJ mutant/JJ blocked/VBN stable/JJ association/NN of/IN Ste4p/NNP with/IN both/DT plasma/NN membranes/NNS and/CC internal/JJ membranes/NNS ./.

Gene JJ Rlt JJ NN IN Gene

Fig. 2. Pre-processing procedures.

Fig. 3. Reforming sentences into tokens in order to process by sequential pattern-mining algorithms.

Fig. 4. Training sample (left) and its extracted patterns for describing gene–gene relations (right).

(4)

UNCORRECTED

PROOF

132 According to our gene lexicon, ‘‘IL-6’’ and ‘‘mdrl’’ are 133 marked as gene names; based on a relation dictionary, 134 ‘‘decrease’’ falls into the ‘‘relation (Rlt)’’ category. Those 135 words that are not included in the categories of gene 136 names and relation words will retain their original part-137 of-speech designation, for example, ‘‘vbd’’, ‘‘vbn’’, and 138 ‘‘to’’ above. We then place these training blocks into 139 sequential pattern-mining algorithms (Agrawal & Srikant, 140 1995; Yen & Chen, 1996) to obtain G–G

relation-extrac-141 tion rules. Sequential pattern-mining algorithms not only

142 discover large itemsets (a group of items that appear

143 together), but also identify large sequences (an ordered

144 list of sets of items). The ordered list composes of the

pat-145 terns of gene–gene interactions, describe exactly which

146 gene acts on which gene (or genes). In the following,

147 Fig. 4illustrates an example of extracted relation patterns,

148 and Fig. 5 summarizes our sequential pattern-mining

149 algorithm.

(5)

UNCORRECTED

PROOF

150 3. Experimental results

151 We utilized our DiGG system to search arsenic-induced 152 bladder-cancer-related genes and their relations (Simeo-153 nova et al., 2000; Sanchez-Carbayo & Cordon-Cardo, 154 2003; Chiang, Yu, & Hsu, 2004). A total of 9870 corre-155 sponding abstracts were retrieved from the PubMed data-156 base (http://www.pubmedcentral.nih.gov/) The system 157 then automatically identified human gene names from 158 those abstracts to filter out non-relevant documents. We 159 applied the DiGG to these documents and found 48 160 sequential G–G relations through valid sentences. Thirty-161 nine G–G relation descriptions were confirmed as ‘‘cor-162 rect’’ by medical experts from collaborated bio-research 163 laboratory. Examples of correct relation pairs included 164 ‘‘PMA hup-regulatei VEGF’’, ‘‘LPA hinducei ACTIN’’, 165 ‘‘p53hactivatei WAF1’’, etc. Our system achieves the preci-166 sion and recall rates of 81.25% and 73%, respectively for 167 bladder cancer related genes (Table 1).

168 According toStaab (2002), the precision rate of manual

169 information-retrieval techniques applied to biomedical

170 documents may above 80% but with a recall rate of less

171 than 20%. The reasons for this already described above.

172 The recall rate for the DiGG system is reasonable because

173 of our system’s ‘‘fault tolerance’’ in extracting interaction

174 rules. A limitation of our system is that it cannot handle

175 the sort of very long and broadly descriptive terms that

176 have been popularized by the biomedical community. This

177 requires more research. Two false-positive examples in

178 which the DiGG system is unable to identify the gene–gene

179 relations are shown below (Fig. 6).

180 • After incubation with 4-ABP, F-actin decreased and

G-181 actin increased in both cytoplasm and nuclei of PC cells

182 and cytoplasmic F-actin ﬁbers were lost, but only

cyto-183 plasmic actin was altered in the BC cells.

184 • Utilizing ASO directed against the raf-1 gene, a central

185 component of this proposed pathway, we were able to

186 reverse the RR phenotype of human tumor cell lines

187 having elevated HER-2 expression or a mutant form

188 of Ha-ras, two genes upstream of raf-1 in signal

189 transduction. 190 191 4. Conclusion 192 We have developed a system for extracting and

visualiz-193 ing gene relations from biomedical documents. We believe

Fig. 6. Graphic interface of the DiGG system. View #1 lists relations extracted from biomedical documents. When users click on one of those relations, it will be able to read its original description in View #2. The G–G network in View #3 displays valid relations between gene products. When users move the cursor to certain node of a gene in View #3, the system will show related information concerning its cellular component, molecular function and biological Table 1

Test performance obtained from the DiGG system for bladder cancer and CBL and HSP27 Precision (%) Recall (%) Bladder cancer 81.25 73 CBL 76 75 HSP27 79 70 Average 78.75 72.6

(6)

UNCORRECTED

PROOF

195 of medical literature data for further analysis, (2) help 196 researchers to understand the inner workings of biological 197 mechanism through gene–gene relations, and (3) oﬀer 198 insights into relationships of genes or proteins via visuali-199 zation of gene networks. To make our system more prom-200 inent to the biomedical research and discovery, it is 201 imperative to allow researchers to observe their experimen-202 tal data such as gene expressions through gene–gene rela-203 tions. Therefore, as part of the enhancements to the 204 system, we can correlate protein–protein interactions with 205 gene expression data to identify unknown biological 206 processes.

207 Acknowledgement

208 This research work was supported in part by Research 209 Grant NSC91-2321-B-006-003 from the National Science 210 Council of the Republic of China.

212

Agrawal, R., & Srikant, R. (1995). Mining sequential patterns. In

213

Proceedings of international conference of data engineering, Taiwan,

214

March.

215

Chiang, J.-H., Yu, H.-C., & Hsu, H.-J. (2004). GIS: a biomedical

text-216

mining system for gene information discovery. Bioinformatics, 20(1),

217

120–121.

218

Marcotte, E. M., Xenarios, L., & Eisenberg, D. (2001). Mining literature

219

for protein–protein interactions. Bioinformatics, 17, 359–363.

220

Ono, T. (2001). Automated extraction of information on protein–protein

221

interactions from the biological literature. Bioinformatics, 17, 155–161.

222

Sanchez-Carbayo, M., & Cordon-Cardo, C. (2003). Applications of array

223

technology: identiﬁcation of molecular targets in bladder cancer.

224

British Journal of Cancer, 89, 2172–2177.

225

Simeonova, P. et al. (2000). Arsenic mediates cell proliferation and gene

226

expression in the bladder epithelium: association with activating

227

protein-1 transactivation. Cancer Research, 60, 3445–3453.

228

Staab, S. (2002). Mining information for functional genomics. IEEE

229

Intelligent System, 17, 70–73.

230

Yen, S. J., & Chen, A. L. P. (1996). An eﬃcient approach to discovering

231

knowledge from large databases. In 4th International conference on

232

parallel and distributed information systems (PDIS ’96) (pp. 8–18),

233

December 18.