行政院國家科學委員會專題研究計畫 期中進度報告
發展生物資訊開放軟體以分析並探討人類骨髓血癌細胞分
化成巨噬細胞的作用機制(2/3)
計畫類別: 個別型計畫
計畫編號: NSC93-3112-B-002-042-
執行期間: 93 年 05 月 01 日至 94 年 04 月 30 日
執行單位: 國立臺灣大學生命科學系
計畫主持人: 阮雪芬
共同主持人: 黃宣誠
計畫參與人員: 李嘉哲, 黃翠琴, 鄭昆杰, 鮑岳洋, 張智欽, 簡觀喬, 張心
儀,歐承翰
報告類型: 完整報告
報告附件: 出席國際會議研究心得報告及發表論文
處理方式: 本計畫可公開查詢
中 華 民 國 94 年 4 月 12 日
行政院國家科學委員會補助專題研究計畫
□ 成 果 報 告
■期中進度報告
(計畫名稱)
發展生物資訊開放軟體以分析並探討人類骨髓血癌細胞分化成巨噬細
胞的作用機制(2/3)
計畫類別:■
個別型計畫 □
整合型計畫
計畫編號:NSC 93-3112-B-002-042-
執行期間:93 年 05 月 01 日至 94 年 04 月 30 日
計畫主持人:阮雪芬
共同主持人:黃宣誠
計畫參與人員:
李嘉哲、黃翠琴、鄭昆杰、鮑岳洋、張智欽、簡觀喬、
張心儀、歐承翰
成果報告類型(依經費核定清單規定繳交):□精簡報告 ■完整報告
本成果報告包括以下應繳交之附件:
□赴國外出差或研習心得報告一份
□赴大陸地區出差或研習心得報告一份
■出席國際學術會議心得報告及發表之論文各一份
□國際合作研究計畫國外研究報告書一份
處理方式:除產學合作研究計畫、提升產業技術及人才培育研究計畫、
列管計畫及下列情形者外,得立即公開查詢
□涉及專利或其他智慧財產權,□一年□二年後可公開查詢
執行單位:國立台灣大學生命科學系
中文摘要
在這個計畫中,我們已經發展了許多開發軟體,包括 BGSSJ、GeneNetwork、ProteMiner-SSM
和 ProtExt,這些軟體都可以被應用到轉錄體及蛋白質體的研究上。這些高效能的方法皆會
產生大量的實驗資料,但是這些資料的註解及功能的探討皆需要生物資訊工具的幫忙,於
是我們發展了 BGSSJ,這是一種以 XML-based Java 所寫的軟體,它可以整合我們有興趣的
基因或蛋白質,並將其功能做統計及畫出樹狀圖,而且可以節省很多時間,該軟體已放在
http://bgssj.sourceforge.net/
。
從高通量實驗技術,如 cDNA microarray 中的網路可以使我們了解活體系統中的行為。我們
與陽明大學黃宣誠教授及中研院陳水田研究員已經共同發展一套軟體,叫做
GeneNetwork,它可以提供四種工程模型及三種資料模式的研究來建構基因間的關係。這個
軟體可在
http://genenetwork.sbl.bc.sinica.edu.tw/
中 download,該結果已經發表在 2004 年
Bioinformatics 的期刊中。
與歐陽彥正教授實驗室共同發展 ProteMiner-SSM 是一個 web-server 的軟體,它可以分析蛋
白質三級結構的相似性並可以進一步作為蛋白質蛋白質和 Ligand 相互作用的依據。該軟體
開放於
http://proteminer.csie.ntu.edu.tw
。詳細資訊已發表於 Nucleic Acid Research 2004 年。
我們也發展了一套文獻探勘的軟體,ProtExt。ProtExt 可以萃取和報告在 NCBI
Entrez-PubMed 系統中蛋白質交互作用的資訊。生物學家利用該軟體可以很容易得到新的路
徑連結的關係。該軟體的 prototype 放在
http://protext.csie.org
。
除此之外我們更利用本計劃發展了蛋白質體的新技術,該結果發表於 2004 和 2005 年的
Proteomics 期刊中。
In this project, we have already developed several open softwares including BGSSJ,
GeneNetwork, ProteMiner-SSM, and ProtExt, for applications in transcriptomics and
proteomics research. Gene-expression profiling and proteomics studies are revolutionizing
biology. These high-throughput methodologies generate experimental data at rates that
exceed knowledge growth. A major challenge for researchers is to make biological sense
out of the large amounts of information proceeding from these experiments. The
interpretation of these experiments can be facilitated by well-presented functional
annotations, which provide an overview of the functions that predominate in clusters as
well as functional annotations for each gene. We have developed BGSSJ, an XML-based
Java application that organizes lists of interesting genes or proteins for biological
interpretation in the context of the Gene Ontology, which organizes information for
molecular function, biological processes and cellular components for a number of different
organisms. The application allows for easy and interactive querying using different gene
identifiers (GenBank ID, UniGene, SwissProt, gene symbol), generates a summary page
with listings of the frequencies of Gene Ontology annotations for each functional category
(cluster), and separate pages with listings of annotations for each gene in a cluster, and
provides quantitative and statistical output files. The visualization browser allows users to
navigate the cluster hierarchy displayed in a tree-like structure and explore the associated
genes or proteins of each cluster through a user-friendly interface. BGSSJ will save time
and enhance the ability to analyze gene expression and proteomics data. BGSSJ is
available at http://bgssj.sourceforge.net/.
Inferring genetic network architecture from time series data generated from
high-throughput experimental technologies, such as cDNA microarray, can help us to
understand the system behavior of living organisms. We (collaborated with Professor
Shui-Tein Chen’s lab) have developed an interactive tool, GeneNetwork, which provides
four reverse engineering models and three data interpolation approaches to infer
relationships between genes. GeneNetwork enables a user to readily reconstruct genetic
networks based on microarray data without having intimate knowledge of the
mathematical models. A simple graphical user interface enables rapid, intuitive mapping
and analysis of the reconstructed network allowing biologists to explore gene relationships
at the system level. Availability: Download from http://genenetwork.sbl.bc.sinica.edu.tw/.
The detailed information has been published in Bioinformatics 20, 2004, 3691–3693.
ProteMiner-SSM, co-developed with Prof. Yen-Jen Oyang’s lab, is a web server for
efficient analysis of similar protein tertiary substructures. Analysis of protein–ligand
interactions is a fundamental issue in drug design. As the detailed and accurate analysis of
protein–ligand interactions involves calculation of binding free energy based on
thermodynamics and even quantum mechanics, which is highly expensive in terms of
computing time, conformational and structural analysis of proteins and ligands has been
widely employed as a screening process in computer-aided drug design. The
information has been published in Nucleic Acids Research, 2004, 32, W76-W82.
We have also developed a text-mining system for protein-protein interaction extraction,
called ProtExt. ProtExt can extract and report protein-protein interactions in the literature
abstracts available at the NCBI Entrez-PubMed system. Our approach is based on the link
grammar and we propose a novel template language (PETL) for extracting protein-protein
interactions embedded in sentences more accurately and customizably. With PETL,
biologists can easily add new templates when seeing a new type of link path. A prototype
web server based on ProtExt system has been implemented and is available at
http://protext.csie.org/.
請詳見以下文章的 reprint
本計畫資助的文章發表列表及詳細內容如下:
1. Juan, H.-F.*, Chang, S.-C., Huang, H.-C., Chen, S.-T. “A New Application of Microwave
Technology to Proteomics” Proteomics 5, 840-842 (the first and corresponding author, SCI:
5.766).
2. Wu, C.-C., Huang, H.-C., Juan, H.-F., Chen, S.-T. (2004) “GeneNetwork: an interactive tool
for reconstruction of genetic networks using microarray data” Bioinformatics 20: 3691-3 (SCI:
6.701)
3. Chen, C.-Y., Oyang, Y.-J., Juan, H.-F. (2004) “Incremental generation of summarized
clustering hierarchy for protein family analysis” Bioinformatics 20:2586-96 (SCI: 6.701)
4. Juan, H.-F., Chen, J.-H., Hsu, W.-T., Huang, S.-C., Chen, S.-T., Lin, J. Y. C., Chang, Y.-W.
Chiang, C.-Y., Wen, L.-L., Chan, D.-C., Liu, Y.-C., Chen, Y.-J. (2004) “Identification of
tumor-associated plasma biomarkers using proteomic techniques: from mouse to human”
Proteomics 4: 2766-2775.
5. Juan, H.-F., Liu, H.-L., Hsu, J.-P. (2004) “Recent developments in structural proteomics:
from protein identifications and structure determinations to protein-protein interactions”
Current Proteomics 1: 183-197.
6. Chang, D. T.-H, Chen, C.-Y., Chung, W.-C., Oyang, Y.-J., Juan, H.-F., Huang, H.-C. (2004)
“ProteMiner-SSM: A web server for efficient analysis of similar protein tertiary
substructures” Nucleic Acids Research. 32:W76-82. (SCI: 7.05)
S
HORTC
OMMUNICATIONA new application of microwave technology to
proteomics
Hsueh-Fen Juan
1, 2, Shing-Chuan Chang
2, Hsuan-Cheng Huang
3and Shui-Tein Chen
4, 51
Institute of Molecular and Cellular Biology, National Taiwan University, Taipei
2National Taipei University of Technology, Taipei
3Institute of Bioinformatics, National Yang-Ming University, Taipei 4Institute of Biological Chemistry, Academia Sinica, Taipei
5Institute of Biochemical Sciences, National Taiwan University, Taipei
Taiwan
Two-dimensional electrophoresis (2-DE) combined with mass spectrometry has significantly improved the possibilities of large-scale identification of proteins. However, 2-DE is limited by its inability to speed up the in-gel digestion process. We have developed a new approach to speed up the protein identification process utilizing microwave technology. Proteins excised from gels are subjected to in-gel digestion with endoprotease trypsin by microwave irradiation, which rapidly produces peptide fragments. The peptide fragments were further analyzed by matrix-assisted laser desorption/ionization technique for protein identification. The efficacy of this technique for protein mapping was demonstrated by the mass spectral analyses of the peptide fragmentation of several proteins, including lysozyme, albumin, conalbumin, and ribonuclease A. The method reduced the required time for in-gel digestion of proteins from 16 hours to as little as five min-utes. This new application of microwave technology to protein identification will be an important advancement in biotechnology and proteome research.
Received: May 29, 2004 Revised: September 5, 2004 Accepted: October 1, 2004
Keywords:
In-gel digestion / Microwave
840 Proteomics 2005, 5, 840–842
Proteomics characterizes cellular proteins and their abun-dance, state of modification, protein complexes and interac-tions [1]. The global changes in cellular protein expression can be visualized by 1-D or 2-DE and identified by MS [2]. Peptide identification can be accomplished by peptide mass mapping. The first step is the in-gel digestion of proteins by sequence-specific proteases. Since each amino acid residue has a unique mass, protein digestion will yield a set of distinct peptides spe-cific to each protein. A mass spectrum of digested peptides, therefore, results in a unique PMF. The set of peptide masses obtained by MS is then used to search against protein databases created by in silico cleavage of all known sequences [3]. This method has been shown to be particularly successful for the identification of proteins [1, 4–6]. However, it is limited by its inability to speed up the in-gel digestion process.
Microwave applications in peptide biochemistry [7], immunohistochemistry [8], antigen retrieval [9], protein staining techniques [10], DNA extraction method [11], and enzyme reaction [12] have been reported. However, the application of microwave to in-gel digestion has not been previously described in detail in the literature despite its apparent use in laboratories.
In this report, we describe a new approach utilizing microwave technology to speed up the in-gel digestion reac-tion. Proteins excised from gels are subjected to in-gel digestion with endoprotease trypsin under microwave irra-diation instead of conventional incubation at 377C. Further-more, the digested peptide fragments were analyzed by MALDI-quadrupole(Q)-TOF MS for protein identification using both PMF and MS/MS technique. First, five proteins, including lysozyme, albumin from chicken egg, albumin from bovine, conalbumin, and ribonuclease were separated by 1-D SDS-PAGE and stained by Coomassie blue (Fig. 1). The Coomassie blue-stained protein bands were excised from the gel and digested with trypsin (Promega, Madison, WI, USA). The gel pieces were soaked in 100% acetonitrile
Correspondence: Dr. Hsueh-Fen Juan, Department of Life
Science, Institute of Molecular and Cellular Biology, National Tai-wan University, No. 1, Sec. 4, Roosevelt Road, Taipei, 106 TaiTai-wan
E-mail: [email protected] Fax: 1886-2-23673374
2005 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.de
Figure 1. 1-DE separation of five proteins (25 mg per protein) in
12.5% SDS gel. The gel was stained with Coomassie blue. Lane 1, molecular weight markers; lane 2, albumin (chicken egg); lane 3, ribonuclease A; lane 4, conalbumin; lane 5, lysozyme; lane 6, albumin (bovine). Protein bands marked by black arrows were further in-gel digested by the traditional or microwave method and identified by mass spectrometry.
for 5 min, dried in a lyophilizer for 20–30 min and rehy-drated in 25 mM ammonium bicarbonate buffer (pH 8.0)
containing 35 mL of 10 mg/mL trypsin until the gel pieces were fully immersed. The gel solution with trypsin was put into a microwave for 5 min at 195 W or 325 W, incubated for 5 min or 16 h at 377C. After the in-gel digestion either with or without microwave irradiation, the peptide fragments were extracted twice with 50 mL of 50% acetonitrile/0.1% TFA. After removal of acetonitrile by centrifugation in a vacuum centrifuge, the peptides were directly spotted on the sample plate of the MALDI-TOF mass spectrometer. Finally, CHCA (0.5 mL of 10 mg/mL) was applied to each spot, and the spots were air-dried at room temperature prior to acquiring mass spectra (M@LDI, Micromass, Manchester, UK). The mass spectrum of tryptic peptides of albumin (chicken egg) are shown in Fig. 2A for the traditional method and Fig. 2B for the microwave method.
The monoisotopic peptide mass values were searched against the Swiss-Prot database using the MASCOT PMF search program (http://www.matrixscience.com) [13]. The corresponding matched tryptic fragments versus theoretical tryptic fragments using PMF are shown in Table 1. A larger number of matched tryptic fragments implies higher yields of proteolytic digestion. The method using microwave power of 195 W gives more matched fragments than the traditional method (377C, 16 h) for all the proteins except conalbumin; in this case the number of matched fragments is slightly lower but close to the traditional method. Considering the required reaction time for the traditional method is much longer, the microwave method apparently gives higher effi-ciency for in-gel digestion. Besides using PMF, the most intense ions in the TOF-MS spectrum were selected to per-form an optimized MS/MS analysis. The product ion spectra generated by MALDI-TOF-MS/MS were searched against the Swiss-Prot database for exact matches using the MASCOT MS/MS ion search program [13]. All five protein bands were
Table 1. The comparison of the MALDI-TOF-MS and MS/MS data
for five standard proteins using traditional and micro-wave methods Traditional (377C) Microwave Power Time 16 h 5 min 195 W 5 min 325 W 5 min Albumin P/Ta) 9/33 0/33 11/33 8/33
(chicken egg) PMF scoreb) 60 –e) 73 45 MS/MS scorec) 45 – 60 38 Ribonuclease A P/T 0/15 0/15 6/15 9/15 PMF score – – 46 60 MS/MS score 70 – 163 83 Lysozyme P/T 6/18 0/18 10/18 10/18 PMF score 52 – 61 63 MS/MS score 70 – 173 125 Conalbumin P/T 20/87 0/87 18/87 19/87 PMF score 69 – 57 56 MS/MS score 13d) – 150 131 Albumin P/T 9/78 0/78 17/78 15/78 (bovine) PMF score 35 – 41 39 MS/MS score 90 – 174 29d) a) P/T denotes the number of matched peptide fragments versus
the theoretical number of total peptide fragments by trypsin in PMF analysis
b) The probability-based MOWSE score returned from the MASCOT peptide mass fingerprint search program
c) The probability-based MOWSE score returned from the MASCOT MS/MS ion search program
d) The produced ion spectra failed to be identified as the correct proteins
e) Not determined
successfully identified as albumin from chicken egg, ribo-nuclease A, conalbumin, lysozyme, and albumin from bovine, respectively. The corresponding search scores are shown in Table 1. The method using microwave power of 195 W gives the highest scores. In conclusion, we have shown that microwave-assisted reactions can produce high efficiency, purity and accuracy in minutes, while traditional methods would require hours.
We have successfully demonstrated the usefulness of a new method to apply microwave to in-gel digestion and to combine with MALDI-Q-TOF for protein identification. This approach speeds up the in-gel digestion process to minutes
versus hours using traditional methods. The use of
micro-wave technology in protein identification will be a very pow-erful tool in proteome research and may prove successful in drug discovery and development.
This work was supported by National Science Council of Taiwan (NSC 93-3112-B-002-042). We gratefully acknowledge the Core Facilities for Proteomics Research, Academia Sinica, Taiwan and Supachai Topanurak for technical support.
842 H.-F. Juan et al. Proteomics 2005, 5, 840–842
Figure 2. The MALDI-TOF mass
spectrum of albumin (chicken egg) is shown. The mass spec-trum was obtained after in-gel digestion: (A) using the tradi-tional method, incubated for 16 h at 377C; (B) using the microwave method with micro-wave irradiation of 195 W for 5 min.
References
[1] Gygi, S. P., Aebersold, R., Curr. Opin. Chem. Biol. 2000, 4, 489– 494.
[2] Sinchaikul, S., Sookkheo, B., Topanuruk, S., Juan, H. F. et al., J. Chromatogr. B 2002, 771, 261–287.
[3] Fenyo, D., Curr. Opin. Biotechnol. 2000, 11, 391–395. [4] Yates, J. R. III, Trends Genet 2000, 16, 5–8.
[5] Patterson, S. D., Aebersold, R., Electrophoresis 1995, 16, 1791–1814.
[6] Juan, H. F., Lin, J. Y. C., Chang, W. H., Wu, C. Y. et al., Electro-phoresis 2002, 23, 2490–2504.
[7] Chen, S. T., Chiou, S. H., Wang, K. T., J. Chin. Chem. Soc. 1991, 38, 85–91.
[8] Schad, A., Fahimi, H. D., Volkl, A., Baumgart, E., J. His-tochem. CyHis-tochem. 2003, 51, 751–760.
[9] Redkar, A. A., Krishan, A., Cytometry 1999, 38, 61–69. [10] Nesatyy, V. J., Dacanay, A., Kelly, J. F., Ross, N. W., Rapid
Commun. Mass Spectrom. 2002, 16, 272–280.
[11] Sato, Y., Sugie, R., Tsuchiya, B., Kameya, T. et al., Diagn. Mol. Pathol. 2001, 10, 265–271.
[12] Pramanik, B. N., Mirza, U. A., Ing, Y. H., Liu, Y. H. et al., Pro-tein Sci. 2002, 11, 2676–2687.
[13] Hirosawa, M., Hoshida, M., Ishikawa, M., Toya, T., Comput. Appl. Biosci. 1993, 9, 161–167.
GeneNetwork: an interactive tool for
reconstruction of genetic networks using
microarray data
Chia-Chin Wu
1, Hsuan-Cheng Huang
1,2,∗, Hsueh-Fen Juan
3,4and
Shui-Tein Chen
1,3,5,∗1Institute of Biological Chemistry and Genomics Research Center, Academia Sinica,
Taipei, Taiwan,2Institute of Bioinformatics, National Yang-Ming University, Taipei, Taiwan,3Department of Life Science, Institute of Molecular and Cellular Biology, Institute of Biochemical Sciences, National Taiwan University, Taipei, Taiwan,
4Department of Chemical Engineering, National Taipei University of Technology,
Taipei, Taiwan and5ALPS Biotech Co., Ltd, Taipei, Taiwan
Received on November 12, 2003; revised on April 29, 2004; accepted on July 5, 2004 Advance Access publication July 22, 2004
ABSTRACT
Summary: Inferring genetic network architecture from time series data generated from high-throughput experimental technologies, such as cDNA microarray, can help us to under-stand the system behavior of living organisms. We have developed an interactive tool, GeneNetwork, which provides four reverse engineering models and three data interpolation approaches to infer relationships between genes. GeneNet-work enables a user to readily reconstruct genetic netGeneNet-works based on microarray data without having intimate knowledge of the mathematical models. A simple graphical user inter-face enables rapid, intuitive mapping and analysis of the reconstructed network allowing biologists to explore gene relationships at the system level.
Availability: Download from http://genenetwork.sbl.bc.sinica. edu.tw/
Contact: [email protected]; [email protected]. edu.tw
Supplementary information: Supplement documentation of algorithms for the four approaches is downloadable at the above location.
INTRODUCTION
Most biochemical relationships among genes, proteins and other organic substrates are known to be many-to-many, meaning that one component can have many functions and one function can be influenced by many components. To understand these complex relationships, the structure of a biological system, such as regulatory relationships of genes, needs to be identified first. Reverse engineering methods provide a good way to model genetic interactions as network
∗To whom correspondence should be addressed.
diagrams of interacting elements based on time-course gene-expression data generated from cDNA microarray exper-iments. The reconstructed genetic network can then be validated experimentally.
Because most genetic network models are mathematically and computationally complicated, a full understanding of the logic and complex behavior of genetic networks will require the development of tools for the computational and visual exploration of complex networks. Although several previ-ous attempts have been made to visualize pathways from prior known knowledge and to simulate system dynamic processes in software packages (Breitkreutz et al., 2003; Dahlquist et al., 2002; Shannon et al., 2003), none of them allow users to infer genetic networks from experimental gene-expression data using reverse engineering approaches. This paper presents a computational and user-friendly soft-ware tool, GeneNetwork, to visually reconstruct genetic networks from gene-expression data using reverse engineer-ing models. It can be used by biologists with only a minimal amount of mathematical training, yet gives them the power to explore a wide range of sophisticated questions about genetic networks.
OVERVIEW OF THE SOFTWARE
The architecture of GeneNetwork, written in C++, is outlined in Figure 1. The work flow for GeneNetwork is as follows: (1) input experimental data in tab-delimited text format; (2) interpolate data through the Interpolation Controller if the number or sets of experimental data points are insufficient to initiate the inference calculations; (3) implement reverse engineering inference approaches through the Modeling Con-troller to generate the gene regulation matrix that describes
C.-C.Wu et al. O n -lin ea r d a ta b a se v a lid a tio n D a ta In p u t D a ta In te r p ola tio n 1 . L in ea r in terpo latio n
2 . L a grange P o lyn o m ial inte rp ola tio n 3 . C u b ic S p lin e in terpo lation
R ev e r se E n g in ee rin g M o d e ls
1 . B o o lea n N etw o rk 2 . L in ear M o d el 3 . S -S yste m
4 . B a yesia n N etw o rk
N etw o r k V isu a liza tio n
1 . R a n d o m L ayo u t 2 . C ircu lar L ayo u t 3 . L aye r L ayo u t G e n e R eg u latio n M a trix G e n e tic A lg o rith m S e arch the so lution sp ace In terp o lation C o n troller In tera c tiv e In terfa ce M o d elin g C o n troller In fo rm atio n Vie w er N etw o rk G ra p h View er U S E R
Fig. 1. The architecture of GeneNetwork.
how genes regulate each other; (4) automatically draw the network for visualization, based on the regulation matrix; (5) compare the inferred intuitive network with on-line data-bases such as KEGG (Kanehisa et al., 2004), based on the information from the Network Graph Viewer and the Informa-tion Viewer; and (6) review the proposed sets of experiments and generate hypothesis. These high-level capabilities of GeneNetwork are described as follows.
Interpolation Controller
The required minimum number of data time points depends on the number of variables in the mathematical model for genetic network inference. If the time points of experimental data are insufficient to fulfill the requirement of the specified model, the network analysis can be initiated by interpolation of the time series data points. The Interpolation Controller provides three selections of data interpolation approaches: linear, Lagrange polynomial and cubic spline interpolation (Constantinides and Mostoufi, 1999).
Modeling Controller
Various reverse engineering algorithms have been used to model genetic regulatory networks (de Jong, 2002). Gene-Network offers four different inference models to extract the ‘gene regulation matrix’ from the gene expression data: (1) the linear model (D’haeseleer et al., 1999) is a continu-ous method that uses linear ordinary differential equations to
describe the system; (2) the S-system (Kikuchi et al., 2003) is an approximation of traditional rate laws with a uniform type of non-linear ordinary differential equation in which the component processes are characterized by the power-law functions; (3) the Boolean network (Liang et al., 1998) is a logical description in which variables and functions (the relationships between the components) are simply presen-ted as ON or OFF; and (4) the dynamic Bayesian network (de Jong, 2002) stochastically models causality between genes over time series data. For the latter three models, the genetic algorithm is applied to effectively search for the optimal point in the large solution space and to learn network structure (Repsilber et al., 2002). Users can change the parameters in the four approaches through the Modeling Controller.
Network Graph Viewer/Information Viewer
To extract valuable information from the gene regulation matrix, GeneNetwork embraces several network visualization layouts. A network diagram is presented with nodes corres-ponding to genes and edges indicating relations between the genetic network components. Information on the network structure and genes, from the gene regulation matrix and input information, can be shown on the Information Viewer. Clicking on any node reveals the biological processes that involve the selected gene and its relation to others. GeneNet-work is fully customizable and allows users to define personal settings to generate interaction networks by manipulating
several graphical setting options, such as linkage changes, gene selections, gene searches, font and graph settings, etc.
DISCUSSION
The four inference models in GeneNetwork have different advantages and weaknesses and users can select the appropri-ate model based on their requirements. The linear model is a gross simplification for most biological systems but it offers an easy method to infer genetic network; the assumptions may be unrealistic. The S-system can capture the non-linear system dynamics, although the method exerts large computational cost to search for the optimal solution. In Boolean network model, the regulatory control of gene expression is expressed by logical rules, which allows large-scale genetic networks to be analyzed in an efficient way. The advantages of the dynamic Bayesian network include the ability to model stochasticity, to incorporate prior knowledge, and to handle hidden vair-ables and missing data in a principled way. Nevertheless, determining the optimal network structure of Bayesian net-works is an NP-hard problem. Furthermore, discretization of gene expression in both Boolean and Bayesian models would induce information loss.
In the Supplementary material, we provide detailed descriptions of the four methods and an application of the
Saccharomyces cerevisiae cell-cycle gene-expression data
(Spellman et al., 1998) to GeneNetwork. Many of the inferred gene relations are known to be involved in the S.cerevisiae cell-cycle pathway.
FUTURE WORKS
The future works will focus on the automatic integration with on-line databases to provide more up-to-date genome informa-tion to a user while using GeneNetwork. In addiinforma-tion, the visualization capabilities for large-scale network layout will be enhanced.
ACKNOWLEDGEMENT
We thank John Y. Lin for revising the manuscript. We grate-fully acknowledge the support of the National Research
Program for Genomic Medicine of National Science Council, Taiwan (NSC 91-3112-13-001-002 and NSC 92-3112-B-027-001).
REFERENCES
Breitkreutz,B.J., Stark,C. and Tyers,M. (2003) Osprey: a network visualization system. Genome Biol., 4, R22.
Constantinides,A. and Mostoufi,N. (1999) Numerical Methods for
Chemical Engineers with Matlab Applications. Prentice-Hall Inc., NJ.
Dahlquist,K.D., Salomonis,N., Vranizan,K., Lawlor,S.C. and Conklin,B.R. (2002) GenMAPP: a new tool for viewing and ana-lyzing microarray data on biological pathways. Nat. Genet., 31, 19–20.
D’haeseleer,P., Wen,X., Fuhrman,S. and Somogyi,R. (1999) Linear modeling of mRNA expression levels during CNS development and injury. Pac. Symp. Biocomput., 4, 41–52.
de Jong,H. (2002) Modeling and simulation of genetic regulatory systems: a literature review. J. Comput. Biol., 9, 67–103. Kanehisa,M., Goto,S., Kawashima,S., Okuno,Y. and Hattori,M.
(2004) The KEGG resource for deciphering the genome. Nucleic
Acids Res., 32, 277–280.
Kikuchi,S., Tominaga,D., Masanori,A. and Tomita,M. (2003) Dynamic modeling of genetic networks using genetic algorithm and S-system. Bioinformatics, 19, 643–650.
Liang,S., Fuhrman,S. and Somogyi,R. (1998) REVEAL: a general reverse engineering algorithm for inference of genetic network architectures. Pac. Symp. Biocomput., 3, 18–29.
Repsilber,D., Liljenstrom,H. and Andersson,S.G. (2002) Reverse engineering of regulatory networks: simulation studies on a genetic algorithm approach for ranking hypotheses. Biosystems,
66, 31–41.
Shannon,P., Markiel,A., Ozier,O., Baliga,N.S., Wang,J.T., Ramage,D., Amin,N., Schwikowski,B. and Ideker, T. (2003) Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res., 13,
2498–2504.
Spellman,P.T., Sherlock,G., Zhang,M.Q., Iyer,V.R., Anders,K., Eisen,M.B., Brown,P.O., Botstein,D. and Futcher,B. (1998) Com-prehensive identification of cell cycle regulated genes of the yest
Saccharomyces cerevisiae by microarray hybridization. Mol. Biol. Cell, 9, 3273.
BIOINFORMATICS
Vol. 20 no. 16 2004, pages 2586–2596doi:10.1093/bioinformatics/bth290Incremental generation of summarized
clustering hierarchy for protein family analysis
Chien-Yu Chen
1,∗, Yen-Jen Oyang
1and Hsueh-Fen Juan
21Department of Computer Science and Information Engineering, National Taiwan
University, Taipei 106, Taiwan, R.O.C. and2Institute of Biotechnology and Department
of Chemical Engineering, National Taipei University of Technology, Taipei 106, Taiwan, R.O.C.
Received on November 18, 2003; revised on March 28, 2004; accepted on April 25, 2004 Advance Access publication May 6, 2004
ABSTRACT
Motivation: Protein sequence clustering has been widely exploited to facilitate in-depth analysis of protein functions and families. For some applications of protein sequence clustering, it is highly desirable that a hierarchical structure, also referred to as dendrogram, which shows how proteins are clustered at various levels, is generated. However, as the sizes of contem-porary protein databases continue to grow at rapid rates, it is of great interest to develop some summarization mechanisms so that the users can browse the dendrogram and/or search for the desired information more effectively.
Results: In this paper, the design of a novel incre-mental clustering algorithm aimed at generating summarized dendrograms for analysis of protein databases is described. The proposed incremental clustering algorithm employs a statistics-based model to summarize the distributions of the similarity scores among the proteins in the database and to control formation of clusters. Experimental results reveal that, due to the summarization mechanism incorporated, the proposed incremental clustering algorithm offers the users highly concise dendrograms for analysis of protein clusters with biological significance. Another distinction of the proposed algorithm is its incremental nature. As the sizes of the contem-porary protein databases continue to grow at fast rates, due to the concern of efficiency, it is desirable that cluster analysis of a protein database can be carried out incrementally, when the protein database is updated. Experimental results with the Swiss-Prot protein database reveal that the time complexity for carrying out incremental clustering with k new proteins added into the database containing n proteins is O(n2βlogn), where β ∼=0.865, provided that k n.
Availability: The Linux executable is available on the following supplementary page.
Contact: Graduate School of Biotechnology and Bioinformat-ics, Yuan–Ze University, Chang–Li 320, Taiwan, ROC. Email: [email protected]
∗To whom correspondence should be addressed.
Supplementary information: http://mars.csie.ntu.edu.tw/ ~cychen/protein_clustering/psc.htm
1
INTRODUCTION
Protein sequence clustering is a process that aims to identify sets of homologous proteins in a protein database (Lesk, 2002; Kriventseva et al., 2001a). The information derived from protein sequence clustering is then widely used for further ana-lysis such as protein family discovery, function prediction and database compression (Abascal and Valencia, 2002; Apweiler
et al., 2001; Enright et al., 2002; Li et al., 2001, 2002;
Kriventseva et al., 2001a; Sasson et al., 2003; Yona et al., 1999). The general practice to carry out protein sequence clus-tering is based on pairwise sequence similarity/dissimilarity between two proteins computed by algorithms such as Smith– Waterman (Smith and Waterman, 1981), BLAST (Altschul
et al., 1990, 1997) and FASTA (Pearson and Lipman, 1998).
The widely adopted hypothesis is that a high degree of sequence similarity between two proteins implies that these two proteins also have similar structures and/or functions (Dayhoff, 1976; Hegyi and Gerstein, 1999; Lesk, 2002).
In latest studies of protein sequence clustering, the single-link clustering algorithm (Koonin et al., 1995; Kriventseva
et al., 2001b; Watanabe and Otsuka, 1995) and
alternat-ive graph-based clustering algorithms (Bolten et al., 2001; Enright et al., 2002; Kawaji et al., 2001; Matsuda et al., 1996) are most commonly employed, due to the clustering quality that these two types of algorithms deliver. In protein sequence clustering, a popular measure of clustering quality is based on how well the clusters identified by the clustering algorithm match the protein families defined in some databases (Kawaji
et al., 2001; Yona et al., 1999).
For some applications of protein sequence clustering, it is highly desirable that a hierarchical structure, also referred to as dendrogram, which shows how proteins are clustered at various levels, is generated (Lesk, 2002; Sasson et al., 2002; Kriventseva et al., 2001b). However, as the sizes of contemporary protein databases continue to grow at rapid
rates, the number of nodes and the depth of the dendro-gram generated by the clustering algorithm have become too large for the users to effectively browse the dendro-gram and/or to search for the desired information. As a result, the user may need to impose some thresholds to flatten the dendrogram so that the visualization and interpretative quality of the dendrogram is improved. Nevertheless, fig-uring out the proper threshold values may be a tedious and difficult task for the users, especially for naive users. There-fore, it is of great interest to develop some summarization mechanism for presenting the results of protein sequence clustering.
In this paper, the design of a novel incremental cluster-ing algorithm aimed at generatcluster-ing summarized dendrograms for analysis of protein databases is described. The proposed incremental clustering algorithm employs a statistics-based model to summarize the distributions of the similarity scores among the proteins in the database and to control formation of clusters. Due to the summarization operations employed, the dendrogram generated by the proposed algorithm offers better visual and interpretative quality than the dendrogram gener-ated by the conventional single-link algorithm. In this paper, the weighted average matching rate is employed to measure how well a clustering algorithm can cluster proteins in con-formity with human’s interpretation and no cutoff threshold is imposed to flatten the dendrogram generated by the single-link algorithm. The dendrogram generated by the proposed incre-mental clustering algorithm contains much fewer non-leaf nodes than that generated by the single-link algorithm. Fur-thermore, those clusters identified by the single-link algorithm that best match the protein families defined in the InterPro (Apweiler et al., 2000) are deeply embedded in the dendro-gram. On the other hand, in the dendrogram generated by the proposed incremental clustering algorithm, most of those clusters that best match the protein families defined in the InterPro are located just one level down from the root. Due to the summarization mechanism incorporated, the pro-posed incremental clustering algorithm offers the users highly concise dendrograms for analysis of protein clusters with biological significance.
Another main property of the proposed incremental clustering algorithm is that, when the protein database is updated, there is no need to redo all the clustering analysis from scratch. Instead, the incremental clustering algorithm can refer to an abstraction generated by the previous run of the algorithm and carry out the analysis much more efficiently. This issue is of significance, as contemporary pro-tein databases, such as Swiss-Prot ( Bairoch and Apweiler, 2000) and PIR (Wu et al., 2002), keep growing rapidly. Experimental results with the Swiss-Prot protein database reveal that the time complexity for carrying out incremental clustering with k new proteins added into the database con-taining n proteins is O(n2βlog n), where β∼ 0.865, provided that k n. Initialization phase Summarization process Initial dendrogram Abstract dendrogram Incremental phase New abstract dendrogram Protein database New proteins Some samples
Fig. 1. A system diagram that summarizes the operations carried out
by the proposed incremental clustering algorithm.
The remaining part of this paper is organized as follows. Section 2 elaborates the design of the proposed incremental clustering algorithm. Section 3 reports the experiments con-ducted to evaluate the performance of the proposed algorithm. Section 4 concludes the discussion of this paper.
2
METHODS AND ALGORITHMS
2.1
Overview of the incremental clustering
algorithm
Figure 1 shows the major operations carried out by the incre-mental clustering algorithm presented in this paper. The algorithm consists of two phases of operations, namely, the initialization phase and the incremental phase. In the initialization phase, a set of proteins extracted from the pro-tein database is taken to construct the initial dendrogram. The number of proteins extracted could range from a few hundreds to a few thousands. With the initial set of pro-teins, a conventional agglomerative hierarchical clustering algorithm, such as single-link or complete-link (Han and Kamber, 2000; Jain and Dubes, 1988), is invoked to construct the initial dendrogram. In this paper, the single-link algorithm is employed for carrying out the initial protein clustering.
With the initial dendrogram, a summarization process is then conducted to identify a set of representatives that collectively provide an abstract description of the distribution of the samples. With these representatives, an abstract dendro-gram is generated. The abstract dendrodendro-gram provides the basis for the incremental phase of the clustering algorithm to pro-ceed. In the incremental phase, all the remaining proteins in the database, i.e. those proteins that were not taken as samples, as well as the new proteins that are continuously added into the database are examined one by one and the abstract dendrogram is updated dynamically to reflect the evolution of the protein database.
C.-Y.Chen et al.
In the following three subsections, how the representatives are identified and how clustering is carried out incrementally are elaborated.
2.2
The summarization process
In the summarization process, protein clusters that meet the following statistical criteria are identified as homogeneous clusters. The statistical criteria are imposed to guarantee that the distribution of the pairwise similarity scores among the protein sequences in a homogeneous cluster has a symmet-rical unimodal distribution (Jobson, 1991). In the following discussion, we treat the pairwise similarity scores among the protein sequences in a cluster denoted by C as|C|×(|C|−1)/2 random samples of random variable XC. Let SC denote the set of |C| × (|C| − 1)/2 random samples. In statist-ics, the corresponding skewness and kurtosis, denoted by Skew(C) and Kurt(C), respectively in this paper, are defined as follows (Jobson, 1991): Skew(C)= |SC| (|SC| − 1) × (|SC| − 2) xi∈SC (xi− ¯x)3/s3, Kurt(C)= |S C| × (|SC| + 1) (|SC| − 1) × (|SC| − 2) × (|SC| − 3) × xi∈SC (xi− ¯x)4/s4 − 3 (|SC| − 1)2 (|SC| − 2) × (|SC| − 3) , where ¯x = |S1 C| xi∈SC xi, and s= 1 |SC| − 1 xi∈SC (xi− ¯x)2.
With skewness and kurtosis, thresholds are set to guarantee that each homogeneous cluster has a symmetric unimodal dis-tribution of the pairwise similarities. By setting the lower bound of kurtosis, we can guarantee that a group of protein sequences with a multimodal distribution will be decomposed into a number of homogeneous clusters. By setting both the upper and lower bounds of skewness, we can guarantee that the distribution of the pairwise similarities among the pro-teins in a homogeneous cluster is symmetric. An asymmetric distribution implies that a small subgroup of proteins in the cluster has the pairwise similarities among the subgroup either much higher or much lower than the pairwise similarities among the remaining proteins in the cluster. Accordingly, we define the homogeneous cluster as follows.
Definition1 (Homogeneous cluster). A protein cluster C
is said to be homogeneous, if the corresponding skewness and
kurtosis satisfy the following criterion:
−θs ≤ Skew (C) ≤ θs, and
Kurt (C)≥ θk,
where θs and θk are two thresholds to be set by the user. In
this paper, θsand θkare set to 1 and 0, respectively, based on
experiences learned with extensive experiments.
The criterion of homogeneous clusters presented above is only applied to clusters containing more than Min_ Cardinality proteins. For clusters that contain less than Min_Cardinality proteins, a different criterion is applied. The reason why a different criterion is imposed is that a cluster must contain a sufficient number of proteins for the tests of skewness and kurtosis to be meaningful. The criterion applied to small clusters is that all the pairwise similarities between two proteins in a small cluster must be larger than a threshold denoted by Min_Similarity. Min_Similarity is imposed to guarantee that every homogeneous cluster containing less than Min_Cardinality proteins meets a certain quality criterion. By default, each leaf node in the dendrogram is regarded as a homogeneous cluster containing one single protein. For those clusters that continue to grow in size, the statistical tests that involve the skewness and kurtosis of the cluster will eventually be imposed.
Definition2 (Leaf homogeneous cluster). A cluster C in a
dendrogram is said to be a leaf homogeneous cluster, if C and all of its children are homogeneous clusters and the parent of
Cis not satisfied as a leaf homogeneous cluster.
In the summarization process, one protein in a homogen-eous cluster will be designated as the representative of the cluster. The representative of a homogeneous cluster C is the protein with the maximum lumped sum of the similarity scores to the other proteins in C. In this paper, the representative of protein cluster C is denoted by Rep(C).
With homogeneous clusters and their representatives iden-tified, the summarization process then conducts a bottom-up flattening operation on the dendrogram generated by the agglomerative hierarchical clustering algorithm. The bottom-up flattening operation begins with the leaf homogeneous clusters. All the subclusters under a leaf homogeneous cluster will be removed and all the proteins contained in the leaf homogeneous cluster will become its children. Figure 2 illus-trates the bottom-up flattening operation. Figure 2a shows the initial dendrogram constructed with 100 objects. In Figure 2a, those clusters marked by an asterisk are leaf homogenous clusters according to Definition 2. Figure 2b shows the dendro-gram after the bottom-up flattening operation has been applied to the leaf homogeneous clusters.
The bottom-up flattening operation described above is conducted recursively with each of the leaf homogeneous clusters identified in one level of recursion being substituted by its representative protein. Figure 2c depicts the dendrogram
(a)
(b)
(c)
(d)
(e)
Fig. 2. An example illustrating the flattening operation of the
sum-marization process. (a) A dendrogram with nodes passing the criterion of leaf homogeneous cluster marked. (b) The summar-ized dendrogram after flattening the leaf homogeneous clusters. (c) The dendrogram derived from that in (b) by substituting each leaf homogeneous cluster with its representative. (d) The dendrogram generated after the second recursion of the bottom-up flattening operation is applied. (e) The final dendrogram.
derived from substituting the leaf homogeneous clusters in Figure 2b with their respective representatives and Figure 2d shows the flattened dendrogram after the second level of recur-sion is conducted. Figure 2e shows the final dendrogram after the bottom-up flattening operation is completed and this dendrogram is referred to as the abstract dendrogram.
2.3
The incremental process
With the flattened dendrogram, the incremental phase of the clustering algorithm is then carried out to cluster the remain-ing proteins in the database, i.e. those proteins that are not taken to construct the initial dendrogram, as well as the pro-teins that may be added into the database later on. These proteins are examined one by one. For each protein, the incre-mental clustering algorithm examines whether the protein can be inserted into a leaf homogeneous cluster according to the following criteria. The criterion for inserting a protein p into a leaf homogeneous cluster C is as follows:
µC− sim[p, Rep(C)] σC ≤ θq and sim[p, Rep(C)] ≥ ∀ Csim[p, Rep(C )], where
(1) µCand σCare the mean and standard deviation of the pairwise similarities in C;
(2) Cis a leaf homogeneous cluster of size larger than 2; (3) sim[p, Rep(C)] denotes the similarity between protein
pand protein Rep(C);
(4) θqis a parameter and is set to 1 in this paper for carrying out protein sequence clustering.
Each time a protein is inserted into a leaf homogeneous cluster, the skewness and kurtosis of the cluster will be recom-puted to check whether the cluster still meets the criterion of being homogeneous. Should the cluster, with the protein inserted, fails to pass the test, a split operation will be con-ducted. The single-link algorithm and the flattening operation described in Section 2.2 will be invoked to construct a sub-dendrogram containing all the proteins in the cluster and the newly added protein. The split operation and the reconstruc-tion operareconstruc-tion described in the next paragraph are essential for avoiding order dependence, which is a common problem in many incremental clustering algorithms.
In case the protein being examined cannot be inserted into any of the existing leaf homogeneous clusters, then the protein is temporarily moved to a temporary buffer called TempBuffer and will be processed again later on. Every time TempBuf-fer becomes full, a reconstruction operation is conducted to generate a new abstract dendrogram containing all the pro-teins in the current dendrogram and in TempBuffer. In the reconstruction operation, the primitive objects are the repres-entatives of the leaf homogeneous clusters and the proteins in TempBuffer. In this paper, the single-link algorithm is invoked. During the reconstruction process, two leaf homo-genous clusters in the original dendrogram may merge due to inclusion of the proteins in TempBuffer. As mentioned earlier, merge along with the split operation described above are essential for avoiding order dependence, and alternative forms of these two operations have been widely employed in the design of incremental clustering algorithms (Fisher, 1987; Zhang et al., 1996).
2.4
Reducing the skewness of the dendrogram
In this paper, an optional top-down flattening operation is developed to reduce the skewness of the dendrogram, i.e. to make the dendrogram more balanced. The optional top-down operation creates a superroot that includes the nodes at the highest levels of the dendrogram. Figure 3a shows a dendrogram generated by the proposed incremental clustering algorithm after the bottom-up flattening operation has been applied. Figure 3b depicts the distribution of the heights of the homogeneous clusters in the dendrogram. The height of a cluster is defined to be the number of nodes on the longest path from the cluster to a leaf homogeneous cluster. In Figure 3b, all the vertical bars with horizontal coordination larger than orC.-Y.Chen et al. (a) 93 14 2 1 1 1 1 1 1 1 1 1 1 1 0 20 40 60 80 100 0 1 2 3 4 5 6 7 8 9 10 11 12 13 Height N u m b er of homoge ne ous c luste rs (b) (c)
Fig. 3. An example that illustrates the top-down flattening
opera-tion to reduce the skewness of the dendrogram. (a) A dendrogram generated by the proposed incremental clustering algorithm after the bottom-up flattening operation has been applied. (b) Distribution of the heights of the clusters in the dendrogram shown in (a). (c) The final dendrogram after the top-down flattening operation has been applied.
equal to 3 have values equal to 1. This implies that all nodes with height larger than 3 have skewed child-dendrograms. In this case, the top-down flattening operation is invoked to create a superroot that contains all the nodes with height lar-ger than 3 and the result is shown in Figure 3c. In fact, for each dendrogram, there exists a value θhsuch that the num-ber of nodes in the dendrogram with height h is equal to 1, if h ≥ θh. The top-down flattening operation simply cre-ates a superroot that contains all the nodes with height larger than θh.
In our implementation of the incremental phase, a heur-istic mechanism that exploits the nature of the single-link algorithm is employed to accelerate the reconstruction opera-tion described above. The detailed descripopera-tion of the acceler-ated reconstruction process can be found in (Chen, 2003, http: //mars.csie.ntu.edu.tw/~cychen/PhDThesisChen2003.pdf ).
2.5
Analysis of time complexity
The completed analysis of time and space complexities of the proposed incremental clustering algorithm can be found in (Chen, 2003, http://mars.csie.ntu.edu.tw/~cychen/ PhDThesisChen2003.pdf). In summary, if given that the time
complexity of single-link algorithm is O(n2logn), the time complexity of the proposed algorithm would be O(km)+ O(kq2logq)+ O((k/b_s)m2logm)), where
(1) n: the number of the proteins in the current version of the protein database;
(2) k: the number of new proteins to be added to the protein database;
(3) m: the number of leaf homogeneous clusters in the current version of the abstract dendrogram;
(4) q: the number of proteins that the largest leaf homo-geneous cluster contains;
(5) b_s: the size of the TempBuffer.
As the experiments reported in next section reveal, for cluster analysis of contemporary protein databases, we generally have q m. Therefore, the dominant term of the time complexity for carrying out protein sequence clustering with the proposed incremental clustering algorithm is O(km)+ O((k/b_s)m2log m)) or O(km2log m), if b_s is regarded as a constant.
The analysis presented above shows that the time com-plexity of the proposed incremental clustering algorithm is determined by how m increases as new proteins continue to be added into the protein database. As in no case m could exceed
n, the upper bound of the time complexity is O(kn2log n). On the other hand, if m does not increase as new proteins con-tinue to be added into the protein database beyond a certain point, then m can be treated as a constant and the time com-plexity is O(k). In the experiments reported in this paper, it is observed that, if n k, then in general we have the time com-plexity equal to O(kn2βlog n) or O(n2βlog n), if k is treated
as constant, where 0 < β ≤ 1. In particular, in the experi-ment conducted to cluster all the proteins in Swiss-Prot, it is observed that β = 0.865. This paper also reports the results from several additional experiments conducted to study the correlation between β and the characteristics of the datasets. The general observation is that if the dataset contains a large number of highly similar pairs between the k new proteins and the n proteins that the database originally contains, then
β tends to be smaller. Otherwise, β tends to be larger. In the additional experiments, β ranges from 0.830 to 0.979. In fact, 0 and 1 are the theoretical lower bound and upper bound of β, respectively, as the time complexity of generating a new dendrogram with k new proteins added into a protein database containing n proteins is bounded between O(k) and O(kn2log n).
Another issue that deserves further analysis is the quantity of pairwise protein–protein similarity scores that must be com-puted. The total number of pairwise similarity scores that must be computed for including one new protein into the database could be as high as m+ q. As mentioned earlier, we typically have q m and thus, for adding k new proteins into the 2590
Table 1. Parameter settings employed in the experiments for the proposed
incremental clustering algorithm
Parameter Value
θs 1
θk 0
θq 1
Size of TempBuffer (b_s) 500 or 5000
Min_Cardinality 10 for leaf homogeneous clusters; 5 for non-leaf homogeneous clusters
Min_Similarity 90 (bit-score) for leaf homogeneous clusters; no constraint for non-leaf homogeneous clusters
Table 2. Characteristics of the four datasets used in the experiments
Dataset Number of proteins Number of cross-referenced InterPro Families Number of proteins labeled with family identification
Mouse 4708 861 2563
Human 7471 1067 3796
Rat 2916 714 1902
SP-41 122 564 4212 82 194
database, the total number of pairwise similarity scores that must be computed is in the order of O(km).
3
EXPERIMENTAL RESULTS
This section reports the experiments conducted to evalu-ate the performance of the proposed incremental clustering algorithm. Section 3.1 addresses the clustering and sum-marization qualities delivered by the proposed incremental clustering algorithm. Section 3.2 analyzes the execution time of the proposed incremental clustering algorithm. Table 1 shows the parameter settings employed in these experiments for the proposed incremental clustering algorithm and Table 2 summarizes the characteristics of the four datasets used in this study. Datasets Mouse, Human and Rat contain the proteins belonging to mouse, human and rat, respectively, in Swiss-Prot (Release 40.0, October 2001). Dataset SP-41 contains all the proteins in Release 41.0 (2003) of Swiss-Prot. In the experiments, parameter b_s in Table 1 is set to 500 for the three smaller datasets Mouse, Human and Rat, and is set to 5000 for dataset SP-41 due to its size. In addition, as we have attempted to emulate the environment in which new protein sequences are continuously added into the database, the same numbers of protein sequences are taken from the beginning of the input datasets for construction of the initial dendro-grams. In this paper, the bit-scores computed by the BLAST algorithm (Altschul et al., 1990, 1997) with BLOSUM62 table are employed.
3.1
Evaluation of clustering and summarization
qualities
This section reports the experiments conducted to evaluate the clustering and summarization qualities of the incremental clustering algorithm proposed in this paper. With respect to protein sequence clustering, a popular measure of cluster-ing quality quantifies how well the clusters identified by the clustering algorithm match the protein families identified by biochemists (Kawaji et al., 2001; Yona et al., 1999). Let S denote the set of clusters outputted by the clustering algorithm, the matching rate of S with respect to a protein family F in the InterPro (Apweiler et al., 2000) is defined as follows:
M(F, S)= max Ci∈S
|Ci∩ (F ∩ D)|
|Ci∪ (F ∩ D)|
, (1)
where Ci is a cluster in S, D is the set of proteins on which clustering is conducted and|D| denotes the number of proteins in D. Accordingly, the weighted average matching rate of S is defined as follows: ¯ M(S)= 1 Fi|Fi∩ D| Fi |Fi∩ D| · M(Fi, S) , (2)
where F1, F2, F3, . . . , Fi are protein families in the InterPro. In this paper, evaluation of clustering quality is carried out by comparing the dendrograms generated by the proposed incremental clustering algorithm with those generated by the single-link algorithm.
Figure 4 shows the weighted average matching rates delivered by these two algorithms with the four benchmark datasets. In Figure 4, no cutoff threshold is imposed to flat-ten the dendrograms generated by the single-link algorithm. If a cutoff were imposed, then the weighted average matching rates delivered by the single-link algorithm would turn lower. As Figure 4 reveals, the proposed incremental clustering algorithm and the single-link algorithm deliver comparable performance in terms of weighted average matching rate. In this experiment, the results of the proposed incremental clustering algorithm with the three smaller datasets, Mouse, Human and Rat, are the averages of five independent runs with random order of input sequence.
With respect to summarization quality, Table 3 lists the numbers of non-leaf nodes in the dendrograms generated by the proposed algorithm and by the single-link algorithm with the four benchmark datasets. Again, for the three smaller datasets, the results of the proposed incremental clustering algorithm are the averages of five independent runs with ran-dom order of input sequence. Table 4 lists the average depth that a user needs to traverse in each dendrogram in order to find a cluster that matches one family in InterPro best. Here, a cluster Ci in a dendrogram H is said to match a family F defined in InterPro best, if
|Ci ∩ (F ∩ D)| |Ci ∪ (F ∩ D)| = max Cj∈H |Cj ∩ (F ∩ D)| |Cj ∪ (F ∩ D)| ,
C.-Y.Chen et al. 60.00% 65.00% 70.00% 75.00% 80.00% 85.00% 90.00%
Single-link algorithm The proposed algorithm Mouse Human Rat SP-41
(a) (b) 50.00% 55.00% 60.00% 65.00% 70.00% 75.00% 80.00% 85.00% 90.00%
Single-link algorithm The proposed algorithm
Mouse Human Rat SP-41
Fig. 4. Comparison of the weighted average matching rates delivered
by the proposed incremental clustering algorithm and the single-link algorithm. (a) Comparison of the weighted average matching rate for protein families containing more than 10 proteins. (b) Comparison of the weighted average matching rate for protein families containing more than 30 proteins.
where Cjis a cluster in dendrogram H , and D denotes the set of proteins that H contains. Accordingly, given a dendrogram
H, the numbers listed in Table 4 are computed as follows:
f
j=1Depth[(Fj, H )]
f ,
where
(1) (Fj, H ) denotes the cluster in H that matches Fjbest; (2) Depth of cluster (Fj, H ) in a dendrogram H is the number of edges on the path from (Fj, H ) to the root of the dendrogram;
(3) F1, F2, . . . , Ff, are families defined in InterPro that contains proteins in H .
As shown in Table 3, the dendrograms generated by the pro-posed incremental clustering algorithm contain much fewer non-leaf nodes than the corresponding dendrograms generated by the single-link algorithm. Furthermore, as Table 4 shows, the clusters in the dendrograms generated by the single-link algorithm that best match the protein families in the InterPro are deeply embedded in the dendrogram. On the other hand, in the dendrograms generated by the proposed incremental clustering algorithm, most of those clusters that best match
Table 3. Comparison of the numbers of non-leaf nodes in the dendrograms
generated by the proposed algorithm and the single-link algorithm
Dataset Number of non-leaf nodes
Single-link The proposed algorithm
Mouse 4707 1333.8± 6.14 Human 7470 1986.8± 12.28
Rat 2915 806± 6.54
SP-41 122 563 25 479
Table 4. Comparison of the average depths that a user needs to traverse in
the dendrograms in order to find a cluster that matches one family in InterPro best
Dataset Depth
Single-link The proposed algorithm
Mouse 798.33 1.85± 0.037 Human 1414.24 2.54± 0.050 Rat 520.67 1.78± 0.029 SP-41 15312.95 4.48
the protein families in the InterPro are located no more than five levels down from the root. What Tables 3 and 4 com-bined imply is that, due to the summarization mechanism incorporated, the user can find protein clusters with biological meaning much more easily in the dendrograms generated by the proposed incremental clustering algorithm than in the dendrograms generated by the single-link algorithm. That is, the proposed incremental clustering algorithm offers the users highly concise dendrograms for analysis of protein clusters with biological significance.
In the discussion above, we have reported the overall cluster-ing and summarization quality of the incremental clustercluster-ing algorithm proposed in this paper. In the following, we will present more in-depth analyses. One issue that we have examined is the purities of the homogeneous clusters defined as follows: purity(C)= max i=1,...,f |Fi ∩ C| ∪k=1,...,f(Fk∩ C) ,
where F1, F2, . . . , Ff, are families defined in InterPro that contains proteins in cluster C.
The purity of the homogeneous cluster can be regarded as an inverse index of how likely proteins from different famil-ies are mixed in a homogeneous cluster. In the dendrogram generated by the proposed incremental clustering algorithm with the SP-41 dataset, there are 9017 homogeneous clusters with two or more proteins and with family identification in InterPro. Figure 5 shows a histogram of these 9017 homo-geneous clusters. Among these 9017 homohomo-geneous clusters,
7190 1061 397 158 73 54 33 18 9 6 14 3 1 1 10 100 1000 10000 2~10 11~20 21~30 31~40 41~50 51~60 61~70 71~80 81~90 91~100 101~150 151~200 201~300
# of proteins in a leaf homogeneous cluster
# o f homogeneous clus ters
Fig. 5. The statistics for the homogeneous clusters with two or more
proteins and with family identification.
there are 8653 clusters with purity equal to 1 and the weighted average purity of these 9017 homogeneous clusters is 98.50%. Another issue that we have examined is how consistent the proposed incremental clustering algorithm is. We have compared how the human proteins in the Human dataset are clustered in the dendrograms generated with the Human data-set and with the SP-41 datadata-set. The analysis is based on the modified matching rate defined in the following:
M(F, S)= max Ci∈S
|P ∩ Ci∩ (F ∩ D)|
|P ∩ [Ci∪ (F ∩ D)]| ,
where Ciis a cluster in dendrogram S, D is the set of proteins on which clustering is conducted and P is the Human data-set data-set. Among all the families that contain the proteins in the Human dataset, 71.08% have exactly identical modified matching rates with the two dendrograms, the one generated with the SP-41 dataset and the one generated with the Human dataset. In addition, 20.5% have a higher matching rate with the dendrogram generated with the SP-41 dataset, and 8.42% have a higher matching rate with the dendrogram generated with the Human dataset. Overall, the experimental results reveal that the proposed incremental clustering algorithm per-forms quite consistently with datasets of different sizes and distributions.
Figure 6 presents a subtree of the dendrogram generated with the Human dataset to demonstrate the effects achieved with the proposed incremental clustering algorithm. In this example, the Glutathione S-transferase proteins with an identical subunit are clustered in the same leaf homogeneous cluster and each of these homogeneous clusters corresponds to a family defined in InterPro. The only exception is the leaf homogeneous cluster that contains Glutathione S-transferase theta 1 and Glutathione S-transferase theta 2, which do not belong to any family in the InterPro. In this subdendro-gram, all proteins except the four in family IPR002946 contain domains ‘Glutathione S-transferase, C-terminal’ and ‘Glutathione S-transferase, N-terminal’. The four proteins in family IPR002946 are present in the subdendrogram,
because they contain domain ‘Glutathione S-transferase, C-terminal’.
3.2
Evaluation of execution time
As elaborated in Section 2.5, the time complexity of the pro-posed incremental clustering algorithm for generating a new dendrogram with k new proteins added into a protein database is O(km2log m), where m is the number of leaf homogeneous clusters in the dendrogram corresponding to the current ver-sion of the database. Therefore, it is important to analyze how
mgrows as the number of proteins in the database, denoted by n in the following discussion, increases. Figure 7 shows the results from running the proposed incremental clustering algorithm to cluster all the proteins in Swiss-Prot (Release 41.0, 2003, 122 564 proteins). It is observed that the rela-tion between logm and logn is governed by a linear equarela-tion with slope equal to 0.865. In other words, we have m= cnβ and β = 0.865. We also conducted four additional experi-ments to get more insight about the relation between β and the characteristics of the dataset. The general observation is that if the dataset contains a large number of highly similar protein sequences, then β tends to be smaller. Otherwise, β tends to be larger. In the experiments conducted to cluster the human, rat and mouse proteins, the observed β values are 0.948, 0.979 and 0.888, respectively, when n is sufficiently large. On the other hand, if clustering is conducted on a data-set that contains proteins from two species, then the observed
βvalues is smaller, 0.830 in this case. The reason behind this observation is that there exists a large number of highly similar protein sequences in the human, mouse and rat datasets. As mentioned earlier, in no case, could m exceed n. Therefore, the upper bound of β is 1.
Figure 8 shows the actual execution time of the proposed incremental clustering algorithm in one benchmark case. In this experiment, 7000 human proteins in Swiss-Prot are incre-mentally added into the dataset that contains 4708 mouse proteins in Swiss-Prot. Each time 500 proteins are added and the proposed algorithm resorts to an abstraction generated by the previous run of the algorithm to carry out clustering incrementally. In fact, the execution time of the proposed incremental clustering algorithm is dependent on the imple-mentation of the single-link algorithm. In other words, if the single-link algorithm can run faster due to a more advanced implementation, the proposed incremental algorithm also benefits from this. The execution time reported in Figure 8a does not include the time taken to compute the pairwise simil-arity scores among proteins. Figure 8b compares the number of pairwise similarity scores that need to be computed with the proposed incremental clustering algorithm in this experiment, in comparison with that need to be computed, if clustering is carried out without the summarization process. It is observed that the proposed incremental clustering algorithm reduces the number of sequence alignment operations that need to be carried out by∼70%.