List of Tables

(1)

Study of Mining Protein Structural Properties and its Application

A Dissertation Proposal Presented to the

Department of Computer Science and Information Engineering College of Electrical Engineering and Computer Science

National Taiwan University

In Partial Fulfillment of the Requirements for the Degree

Doctor of Philosophy

by Yu-Feng Huang

Dr. Chien-Kang Huang, Dissertation Supervisor Dr. Yen-Jen Oyang, Dissertation Supervisor

December 11, 2007

(2)

(3)

List of Tables

Table 1. A rough guide to the resolution of protein structure ...13

Table 2. List of protein chains for 6 randomly selected EC families...29

Table 3. Experimental results for local conservation discovery via pair-wise protein structure comparison...29

Table 4. Description of assessment...41

Table 5. Experimental results for enzyme classification prediction. ...42

Table 6. Multiple EC label prediction...43

Table 7. Statistical result of SSBOND and Cysteine pair. ...56

(7)

List of Figures

Figure 1. Yearly growth of released structures in Protein Data Bank...2

Figure 2. 20 standard amino acids. ...3

Figure 3. Venn diagram grouping amino acids according to their properties. ...5

Figure 4. Position specific score matrix (PSSM) generated by PSI-BLAST...12

Figure 5. The hierarchy of CATH. ...15

Figure 6. The overall framework for mining conserved local structure. ...20

Figure 7. Neighborhood residues sphere. ...26

Figure 8. The flow chart for mining conserved structural patterns via pair-wise protein structure comparison...27

Figure 9. The flow chart for mining conserved structural patterns via NRS-based conservation mining approach. ...28

Figure 10. Protein PDB ID 1J9Z:A and its binding substrates. ...30

Figure 11. PDB ID 1SMI:A and the substrate is HEM...32

Figure 12. Neighborhood Residues Sphere. ...35

Figure 13. Flow chart of mining conservation patterns. ...36

Figure 14. Enzyme classification prediction...40

Figure 15. Conserved patterns of EC 3.2.1.17...44

Figure 16. Conserved local structure and a ligand...46

Figure 17. Conserved pattern and ligand, SDK, of protein PDBID 1AU0...48

Figure 18. Residue environmental sphere...51

Figure 19. Database table schema for structural property exploration. ...52

Figure 20. Distribution between distance and its frequent. ...56

Figure 21. Disulfide bond and ligand. ...57

Figure 22. Comparison of latest version and previous version of 1UMR. ...61

Figure 23. Encoding scheme for transforming structure information into binary signature...63

Figure 24. Residue-residue contacts. ...64

Figure 25. Protein-ligand contact...64

Figure 26. Protein-protein interaction region...65

Figure 27. Protein-RNA interaction region...65

Figure 28. Protein-DNA interaction region. ...66

Figure 29. Intermolecular disulfide bond...66

Figure 30. Intramolecular disulfide bond...67

(8)

Abbreviations

1D one-dimensional 3D three-dimensional ASA accessible surface area

CATH CATH Protein Structure Classification – Class, Architecture, Topology, Homologous Superfamily

CSA Catalytic Site Atlas DNA deoxyribonucleic acid HGP Human Genome Project NDB Nucleic Acid Database NMR Nuclear Magnetic Resonance RMSD root mean square deviation RNA ribonucleic acid

RSA relative solvent accessibility PDB Protein Data Bank

PSSM Position Specific Score Matrix SCOP Structural Classification of Proteins SH2 src homology 2

wwPDB Worldwide Protein Data Bank

(9)

1. INTRODUCTION

1.1. Current Status of Structural Genomics

The “Human Genome Project” (HGP) was a 13-year project coordinated by the U.S Department of Energy and the National Institutes of Health since 1990. This project was completed in 2003, and researches from Hong Kong, Japan, France, Germany, China, and others joined the HGP during the period. Project goals were to identify all the approximately 20,000-25,000 genes in human DNA, determine the sequences of the 3 billion chemical base pairs that make up human DNA, store this information in databases, improve tools for data analysis, transfer related technologies to the private sector, and address the ethical, legal, and social issues (ELSI) that may arise

from the project (adopt from

http://www.ornl.gov/sci/techresources/Human_Genome/home.shtml).

With the huge growth of protein sequences, structures, and biological data, researchers have to face a huge scale of dataset for analysis. Bioinformatics can be defined as the study of two information flows in molecular biology [1]. He pointed out two information flows: the first is based on the central dogma of molecular biology: DNA sequences are transcribed into mRNA sequences and then mRNA sequences are translated into protein sequences, and the second is based on experimental information from observations to models. In the first flow, we use informatics methodology to analysis biological data of sequences and structures. In the second flow, we have to build a model to explain our observations and then use new experiments to test a model.

Beccari prepared the first protein of vegetable origin [2] in 1747, and the Protein Data Bank began to collect examined three-dimensional structural data from 1976. In the past three decades, the number of released structures grows exponentially as shown in . As of January 1, 2008, there are 48161 determined structures examined by X-ray or nuclear magnetic resonance (NMR) in Protein Data Bank (PDB) [3]. They include proteins, protein complexes, nucleic acids and protein nucleic acid complexes.

Determined protein structures have been greatly increasing from 1976, since then protein functional analysis has become more and more important [13].

Accompanying with the fast growth of Protein Data Bank, protein functional analysis has become more important. Researches focused on functional classification have

(10)

been investigated for many years. Based on previous researches, if we attempt to understand the relationship between protein structure and function, data mining technique should be involved for massive protein structure analysis.

Figure 1. Yearly growth of released structures in Protein Data Bank.

The released statistics was updated on December 11, 2007.

(11)

Structural bioinformatics is the subdiscipline of bioinformatics that focuses on the representation, storage, retrieval, analysis, and display of structural information at the atomic and subcellular spatial scales [4]. Protein structure determination and prediction, both have been investigated for many years. These issues in structural biology include secondary structure prediction [5-7], protein disorder region prediction [8-10], b-factor prediction [11], binding residue prediction [12, 13], RNA-binding residue prediction [14-17], DNA-binding residue prediction [18-24]

and prediction of protein-protein interaction [25-27], protein-RNA interaction [17], or protein-DNA interaction [28]. Furthermore, researches on contact preferences also have been investigated in interaction regions of protein-protein [29], protein-RNA [30], and protein-DNA [31, 32].

Figure 2. 20 standard amino acids.

This diagram is adapted from

http://matcmadison.edu/biotech/resources/proteins/labManual/images/amino_000.gif

(12)

1.2. Sequences, Structures, and Functions 1.2.1. Protein Structure

Proteins are linear chains of amino acids and linked together by polypeptide bonds between the carboxyl and amino groups of adjacent amino acid residues in order.

The sequence of the different amino acids is called a primary structure. In nature, there are 20 standard amino acids, but the residue in a protein would be chemically altered in post-translational modification. These 20 standard amino acids in Figure 2 are alanine (Ala, A), arginine (Arg, R), asparagine (Asn, N), aspartic acid (Asp, D), cysteine (Cys, C), glutamic acid (Glu, E), glutamine (Gln, Q), glycine (Gly, G), histidine (His, H), isoleucine (Ile, I), leucine (Leu, L), lysine (Lys, K), methionine (Met, M), phenylalanine (Phe, F), proline (Pro, P), serine (Ser, S), threonine (Thr, T), tryptophan (Trp, W), tyrosine (Tyr, Y), and valine (Val, V). Each amino acid has its own properties shown in .

In proteins, secondary structure can be recognized by DSSP software [33] according to the hydrogen bonds between backbone amide groups, and can be classified as α-helix and β-sheet. The secondary structure of a protein is nonlinear, localized to regions of an amino acid chain, and formed and stabilized by hydrogen bonding.

The hydrogen bonding in these elements of structure provides much of the enthalpy of stabilization that allows the polar backbone groups to exist in the hydrophobic core of a folded protein [34]. In biochemistry, the tertiary structure of a protein is its three-dimensional structure with the atomic coordinates. However, in protein structure recognition, secondary structure is widely used to describe a three-dimensional form of local segments of biopolymers instead of atomic coordinates. Tertiary structure of a protein is nonlinear, formed and stabilized by hydrogen bonding, covalent bonding, hydrophobic packing toward core and hydrophilic exposure to solvent. A quaternary structure of a protein is formed by the folded chains which have more than one polypeptide chain. Protein assemblies composed of more than one polypeptide chain are called oligomers and the individual chains of which they are made are termed monomers or subunits [34]. Quaternary structure of a protein is nonlinear, global and across distinct amino acid polymers, formed by hydrogen bonding, covalent bonding, hydrophobic packing and hydrophilic exposure, and favorable, functional structures occur frequently and have been categorized.

(13)

Figure 3. Venn diagram grouping amino acids according to their properties.

It is one of the most classical Venn diagram of amino acid properties. The picture is adapted from http://condor.ebgm.jussieu.fr/~debrevern/VENN_DIAGRAM/aa_venn_diagram.png.

In protein structure, residues interact with each other in three-dimensional space via covalent bonding or non-covalent bonding such as electrostatic, hydrogen bonds or Van der Waals forces. The covalent bonding is an induced dipole-dipole interaction that is characterized by the sharing of pairs of electrons between atoms, or between atoms and other covalent bonds. The covalent bonding is stronger than most non covalent bonding. Disulfide bond is one kind of special bond connectivity in protein structure, which is linked via two S_γ atoms of cysteine residues in protein folding.

Disulfide bond could be occurred inter-molecularly or intra-molecularly. Disulfide bond formation is a covalent modification; the oxidation reaction can either be intramolecular (within the same protein) or inter-molecular (within different proteins, e.g., antibody light and heavy chains). The reaction is reversible.

(14)

Van der Waals interactions contribute strong repulsion at short distances and weak attraction at distances just greater than the sum of the atomic radii. Salt bridges play important roles in protein structure and function, e.g., in oligomerization, molecular recognition, allosteric regulation, domain motions, flexibility, thermostability, and alpha-helix capping. The electrostatic contribution to the free-energy change upon salt-bridge formation varies significantly, from being stabilizing to marginal to being destabilizing [35]. A hydrogen bond occurs between an electronegative atom and a hydrogen atom bonded to another electronegative atom, which is a special type of dipole-dipole bond. The typical hydrogen bond is stronger than Van der Waals forces, but weaker than covalent, ionic and metallic bonds.

1.2.2. Sequence, Structure, and Function

With the increasing growth of sequence, structural, and biochemical data, evolution of protein function can be determined from sequence and/or structure. Homologous proteins can be determined via BLAST [36] or FASTA [37] alignment approach to identify the relation between proteins. Sequence alignment algorithm can tell us sequence similarity between protein sequences, and evolutionary information can be detected via alignment of aligned sequence fragments. With the help of multiple sequence alignment, sequence conservation also can be discovered to link with protein function. From a structural standpoint, protein function and protein structure are inherently linked [38], and structural template comparison can recognize protein function by comparing template against protein structures [39]. Neither sequence similarity nor structure similarity can directly infer protein function alone. They all tell us partial information about protein function or something about evolution [40].

1.3. Tackled Issues in this Dissertation

1.3.1. Study of Local Structure Representation

According to research recommendation from Najmanovich et al., predicting the function of a protein from its three-dimensional structure is a major intellectual and practical challenge [41]. They reveal that detecting local structure similarity can be applied to predict a function of a protein. The point mentioned by Orengo et al. is that sequence-based methods can fail to detect very distant relationships and these can

(15)

only be recognized from 3D structure, which is much more highly conserved during evolution [42]. Moreover, researchers make more effort on the study of protein functional site or ligand binding areas [39, 43, 44]. All these research findings give us an important hint on the study of relation between protein function and local structure. Hence, can we develop an appropriate representation to describe the connection between the dedicated local structure and corresponding function in proteins?

1.3.2. Study of Conserved Structure for Functional Classification

Based on the common assumption that proteins of the same function share common local regions, the concept of local region conservation comes from a motif, which is a fragment with biological or functional meaning. In sequence analysis, Campbell et al. [45] applied sequence alignment to discover sequence conservation, and then they map conserved regions into their three-dimensional space which are close to binding area. In structure analysis, the binding area of protein-ligand complex is widely used to identify protein function via local structure recognition. CSA (Catalytic Site Atlas) [39] and Protemot [44] use protein-ligand complexes to recognize protein function via local structure similarity. Based on research results of CSA and Protemot, the authors point out that non-homologous proteins may have the same function; in the other words, proteins have dissimilar global structures may have the same function, and the observations can be found that function may occur in protein local structure. Currently, we approach two directions to achieve, and one is protein structure comparison, and another is to use neighborhood residues sphere (NRS), a sphere with the radius of d (d=10 as default), to describe local structure. In our experimental results, both approaches can discover conserved local structures for most enzyme family, and some of conserved local structures are close to ligands.

1.3.3. Mining General Protein Structural Properties

With the fast growth of protein structure, it provides more materials on the study of discovering local residue environment with/without chemical bond information.

Residue environment has been studied and applied on protein threading and protein binding site characterization [46]. In the protein structure, a residue is the essential element for conformation, and residue-residue contacts will affect the overall framework of a protein structure. Protein folding is highly correlated to residue contacts with chemical bonds such as covalent bonds, ionic bonds, hydrogen bonds,

(16)

Van der Waals attractions, or disulfide bonds. For quick searching of residue environment, we use residue environmental sphere to describe environment information surrounding a residue. On the purpose of protein structural property exploration, we have to analyze different residue neighborhood in whole protein structure collection. Applying mining technique on protein structures is an interesting issue to discover residue environmental information inside protein structure, and to handle huge protein structure collection is also a great challenge to store entire structure and sphere information in database.

1.3.4. Involving the New Approaches of Fast Structure Mining

Because massive pair-wise sequence and structure comparison are time-consumed task, we still have to improve performance for fast structure mining. According to the definition of protein blocks [47] proposed by Brevern et al., the authors try to use protein blocks to understand the sequence-structure relationship and structural alphabet [48] is an improved representation of protein blocks. Therefore, they encode a protein structure into a one-dimensional sequence and they can treat one-dimensional sequence as protein sequence and BLAST can be easily applied.

They also proposed substitution matrix for structural alphabet based on statistics analysis of alphabet mutations. In contrast to structural alphabet, we propose to encode protein structure via signature and indexing technique for fast structure mining.

The same as conserved structure mining, we use neighborhood residues sphere to describe protein local structure, transform each sphere as bit-string signature, and the indexing technique will be applied to provide fast database search. Furthermore, we encode each neighborhood residues sphere as environmental signature for protein structure indexing and quick database searching.

1.3.5. Coordination of Sequence and Structural Conservation

According to research results of MAGIIC-PRO [49] developed by Hsu et al., which is driven by homologues protein sequence analysis on detecting a functional signature, the authors approach sequence pattern mining to discover functional signatures of a query protein. Their experimental results reveal that sequence conservation has correlation to protein function according to ligand information. Based on our previous study on local conserved structures, we attempt to integrate sequence conservation and structure conservation for analyzing the relationship among sequences, structures, and functions in the future. Our original idea is to discuss the

(17)

relationship between sequence conservation and structure conservation for each enzyme family. In each enzyme family, proteins within an enzyme family have the same function derived from different species; therefore, it is a good start to discover sequence and structure conservation based on the relationship between sequences, structures and functions.

1.3.6. Apply Mining Results in Function/Structure/Sequence Prediction and Annotation

According to the experimental results of first three sub-topics, we plan to combine mining results and machine learning technique to improve prediction accuracy and annotation. Recent research has been applied structure properties in primary sequence prediction to improve prediction accuracy. Computer-aid annotation for protein sequences, structures, and functions has been studied based on protein global sequence and structure information. Our idea start from protein local sequence and structure to correlate with its function; therefore, we attempt to include protein structure properties of local region to study the correlation of sequence, structure, and function from the view of local region. In addition, we will also include structure information as feature information in primary sequence prediction of machine learning.

1.4. Overview

The sections of the paper are organized roughly according to the issues tackled in this dissertation. In the next section, we review previous researches related to structure mining and protein function. Section 3 considers the framework for mining conserved local structure and the study of local structure and protein function.

Section 4 gives detail information about each part of overall framework. Section 5 discusses and summarizes experimental results for this dissertation. Finally Section 6 introduces our ongoing status and further study.

(18)

(19)

2. LITERATURE REVIEWS

2.1. Sequence, Structure, and Function

Sequence similarity is determined by aligning sequences according to percent identity.

Homologous sequences derived from the same ancestral sequence can be examined under some identical residues at the corresponding positions in the sequence. In general, similar protein sequences can be implied that they have similar structures and similar functions. Therefore, protein function can be inferred by determining sequence similarity and structure similarity, but there are still some exceptions. For example of TIM-barrel proteins, they have eight β/α motifs folded into a barrel structure, and many functions [50]. Proteins that differ in sequence and structure may have converged to similar active site, catalytic mechanisms and biochemical function. Proteins with low sequence similarity but very similar overall structure and active sites are likely to be homologous [34].

2.2. Sequence Motif and Structural Motif

The term motif is used to represent a characteristic fragment which is biological significant to protein function. It can be represented as sequence motif, structural motif, and functional motif. A sequence motif refers to a particular amino acid sequence that is characteristic of a specific biochemical function. Zinc finger motif is an example of sequence motif which is found in a family of DNA-binding proteins, and the motif is formed as Cys-X_2-4-Cys-X₃-Phe-X₅-Leu-X₂-His-X₃-His (C₂H₂) [51, 52]. Sequence motif can be evolution conservation which could be discovered by sequence alignment based evolutionary similarity. Researches related to discover sequence conservation has been found that discovered sequence motifs correlate to biological functions [53]. The structural motif refers to motif in three-dimensional space. Commonly, structural motif is a set of contiguous secondary structure elements that either have a particular functional significance or define a portion of an independently folded domain [34]. The helix-turn-helix is an example of structural motif found in DNA-binding proteins.

(20)

2.3. Structural Property

In sequence based prediction, the position-specific scoring matrix (PSSM) is used to improve their prediction accuracy for protein sequence analysis as shown in Figure 4.

The PSSM gives the log-odds score for finding a particular matching amino acid against to a target sequence. Therefore, the prediction tools treat PSSM as sequence property for each amino acid. In protein structure prediction, amino acid property, secondary structure information, b-factor, accessible surface area (ASA), or relative solvent accessibility (RSA) are structural properties. Therefore, protein structure prediction from purely sequence information has been tried to encode biochemical properties relative to protein structure to improve prediction accuracy. In 1992, Singh and Thornton [54] discovered the atlas of protein side-chain interaction to understand sidechain-sidechain interactions. In this research, they revealed interactions for 20 * 20 amino acids, and counted the frequency for each amino acid pairs.

Figure 4. Position specific score matrix (PSSM) generated by PSI-BLAST.

In addition, Glaser et. al. [55] also studied structural property of residues at protein-protein interfaces. In order to realize the inside of protein structure conformation, protein structural property exploration is very important such as amino

(21)

acid interactions or residue-residue contact. Contact preference is another important issue for structure environment analysis to discuss how residues interact with each other [29-31]. Each residue has different tendencies to contact with other residues in the structure environment. Furthermore, residue-residue contact in protein-protein interaction region is another way to know residue environment while protein interacts with another one. In addition, contact preference of residue and nucleic base pair is another issue for structure environment analysis in interaction region.

Table 1. A rough guide to the resolution of protein structure Resolution (Å) Meaning

> 4.0 Individual coordinates meaningless. Secondary structure elements can be determined.

3.0 - 4.0 Fold possibly correct, but errors are very likely. Many sidechains placed with wrong rotamer.

2.5 - 3.0 Fold likely correct except that some surface loops might be mismodelled. Several long, thin sidechains (lys, glu, gln, etc) and small sidechains (ser, val, thr, etc) likely to have wrong rotamers.

2.0 - 2.5 As 2.5 - 3.0, but number of sidechains in wrong rotamer is considerably less.

Many small errors can normally be detected. Fold normally correct and number of errors in surface loops is small. Water molecules and small ligands become visible.

1.5 - 2.0 Few residues have wrong rotamer. Many small errors can normally be detected.

Folds are extremely rarely incorrect, even in surface loops.

< 1.5 In general, structures have almost no errors at this resolution. Individual atoms in a structure can be resolved

Table is taken from Daniel (2007) and Blow (2002).

2.4. Structural Database

2.4.1. Worldwide Protein Data Bank

The Worldwide Protein Data Bank (wwPDB) [56] consists of organizations that act as deposition, data processing and distribution centers for PDB data. The founding members are RCSB PDB (USA) [3], MSD-EBI (Europe) and PDBj (Japan). Since 1747 Beccari discovered first protein of vegetable origin [2], and Protein Data Bank (PDB) began to collect three-dimensional structure data in 1976. Now the PDB contains 47625 protein structures on December 4, 2007. It is a worldwide repository for three-dimensional structure data of proteins, protein complexes, nucleic acids, and

(22)

protein nucleic acid complexes. Typically, these data examined by X-ray crystallography, NMR spectroscopy, or electron microscopy. Most of structures are determined by X-ray crystallography, and then NMR spectroscopy. In , it is a rough guide to the resolution of protein structure that can help us how to utilize the structural data information. Materials in this table is taken from Blow [57] and Minor [58].

2.4.2. Enzyme Data Bank

The enzyme data bank [59] is a collection of information focused on all known enzymatic reactions defined by the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology (NC-IUBMB). The EC (enzyme commission) number is given by International Union of Biochemistry and Molecular Biology. The EC number is designated by four numerals such as 1.6.2.4 similar to Internet Protocol address, and it represents the hierarchical classification of enzymes according to the type of chemical reactions catalyzed by enzymes. In enzyme data bank, entry corresponding to EC number consists of recommended name, alternative names, catalytic activity, cofactors, and protein sequences linked to SWISS-PROT [60]. The six classes in the top hierarchy are oxidoreductases (EC 1.-.-.-), transferases (EC 2.-.-.-), hydrolases (EC 3.-.-.-), lyases (EC 4.-.-.-), isomerases (EC 5.-.-.-), and ligases (EC 6.-.-.-).

2.4.3. Nucleic Acid Database

The Nucleic Acid Database [61] established in 1992 is a single archive to store three-dimensional crystal structures of nucleic acids including DNA (Deoxyribonucleic acid) and RNA (Ribonucleic acid). As of June 2007, the Nucleic Acid Database has collected 3557 nucleic acid structures are derived from both the Protein Data Bank and the literature.

2.5. Structural Classification 2.5.1. SCOP

The Structural Classification of Proteins (SCOP) database provides a detailed and comprehensive description of the relationships of all known proteins structures. It is a largely manual classification of proteins according to their structural domains based

(23)

on similarities of their amino acid sequence and three-dimensional structure. The class representation is on hierarchical levels: the first two levels, family and superfamily, describe near and far evolutionary relationships; the third level, fold, describes geometrical relationships. The leaf level is protein domain, the basic unit in the hierarchy. Under the domain, there are proteins PDB entries that reference to their own PDB description. Detail descriptions for SCOP hierarchy are:

1. Class - general structural architecture of the domain

2. Fold - similar arrangement of regular secondary structures but without evidence of evolutionary relatedness

3. Superfamily - sufficient structural and functional similarity to infer a divergent evolutionary relationship but not necessarily detectable sequence homology

4. Family - some sequence similarity can be detected.

Figure 5. The hierarchy of CATH.

2.5.2. CATH

The CATH Protein Structure Classification is a semi-automatic, hierarchical

(24)

classification of protein domains published in 1997 by Christine Orengo, Janet Thornton and their colleagues. CATH shares many broad features with its principal rival, SCOP, however there are also many areas in which the detailed classification differs greatly. The name CATH is an acronym of the four main levels in the classification. The four main levels of the CATH hierarchy are as follows:

1. Class - the overall secondary-structure content of the domain (automatic) 2. Architecture - a large-scale grouping of topologies which share particular

structural features (orientation of secondary structures, manual)

3. Topology - high structural similarity but no evidence of homology.

Equivalent to a fold in SCOP (topological connection and number of secondary structures)

4. Homologous superfamily - indicative of a demonstrable evolutionary relationship. Equivalent to the superfamily level of SCOP. (superfamily clusters of similar structures and functions)

5. Sequence family

CATH defines four classes according to the ratio of secondary structure elements:

mostly-alpha, mostly-beta, alpha and beta, few secondary structures. The domains are automatically sorted into classes and clustered on the basis of sequence similarities. These groups form the H levels of the classification. The topology level is formed by structural comparisons of the homologous groups. Finally, the Architecture level is assigned manually. As shown in Figure 5, it is a CATH hierarchy of class, architecture, and topology levels.

2.6. Functional Classification 2.6.1. Enzyme Classification

Clearly, functional hierarchical classification classifies proteins into class according to protein function and reaction. Functional classifications derive groups on the basis of functional similarity in terms of enzyme reaction mechanism, participation in biochemical pathways, functional roles and cellular localization [62]. There are three reasons choosing functional hierarchical classification, (1) in order to provide a function, proteins should have stable structure in their functional area; (2) correlation between functional related structure region and protein function is easy to be verified via contact area of protein-substrate complex; (3) if proteins have the same function, they should have conservation in their functional areas.

(25)

The Enzyme Commission (EC) number is developed by the International Union of Biochemistry and Molecular Biology (IUBMB), which is used to classify enzyme based on the chemical reaction they catalyze. In enzyme, proteins with the same EC number have the same protein function or biochemical reaction; therefore, they may have similar functional area to react with other molecular to provide function. In enzyme hierarchical classification, they use four levels to classify enzyme into hierarchy. The top level, reaction type of the enzymes, is divided into six major classes including oxidoreductases (1.-.-.-), transferases (2.-.-.-), hydrolases (3.-.-.-), lyases (4.-.-.-), isomerases (5.-.-.-), and ligases (6.-.-.-), defined according to the reaction catalyzed. The second level is divided based on group specific action, the third level by substrate specificity and the forth level contains enzymes. Currently, Thornton et. al. extend from the Enzyme Data Bank [59] and the Protein Data Bank to

build enzyme structures database

(http://www.ebi.ac.uk/thornton-srv/databases/enzymes/).

Six major classes in enzyme.

Class 1. oxidoreductases (1.-.-.-) Class 2. transferases (2.-.-.-) Class 3. hydrolases (3.-.-.-) Class 4. lyases (4.-.-.-) Class 5. isomerases (5.-.-.-) Class 6. ligases (6.-.-.-)

Besides, enzyme classification provides a good environment to realize protein structure and protein function. Proteins with the same EC number have same function or activate the same reaction would be grouped together. Enzyme active sites commonly occur in large and deep cavity on the protein surface, and they need significant favorable interactions between ligand and protein, which usually means that other small molecule ligand are embedded in surface depressions. If proteins provide the same function, they should have certain level of conservations on their structure conformation, and those conservations might be conserved by its conformation or function. Therefore, structure conservations might be reserved for structure conformations or protein functions. As the enzyme classification is one kind of functional classifications, and we try to find the relation of structure conservation and protein function.

(26)

(27)

3. THESIS STATEMENT

3.1. Motivation

In this dissertation, we focus on the study of discovering the relation between structure and function from a viewpoint of local structure. Based on the assumption that protein structure is more conserved for protein function, we try to discover conserved structural information from known protein functions. Therefore, the question would be to mine local structures shared among a group of proteins correlated to their function. But, another issue is that sequence and structure similarity will affect the quality of mined local structure. The reason is that if a group of proteins share highly both sequence and structure similarity, the mining result would be meaningless.

Currently, we focus on the following subtopics, and there are (i) study of local structure representation; (ii) study of conserved structure for functional classification;

(iii) mining general protein structural properties; (iv) coordination of sequence conservation and structural conservation; (v) involving the new approaches of fast structure mining; and (vi) applying mining results in function/structure/sequence prediction and annotation.

3.2. Framework of this Dissertation

3.2.1. Study of Local Structure Representation

There are different types of representation could be applied to describe local structure such as protein blocks [47], structural alphabet [48, 63], structural motif [64, 65], or sequence motif with corresponding three-dimensional structure [65]. The original idea of protein blocks comes from N-gram in information retrieval. They use five consecutive Cα (“protein blocks”) as a block to describe protein local structure;

therefore, a protein structure can be composited as several protein blocks [47].

Moreover, they use an unsupervised cluster analyzer to identify a local structural alphabet composed of 16 folding patterns from protein blocks. Yang et al. [66, 67]

also apply structural alphabet to describe local structure, and they obtain 23 structural alphabets to represent 23 local structures. Jonassen et al. [65] use neighborhood sequence to discover sequence patterns and then check patterns in their corresponding

(28)

space. If the sequence pattern has k structure occurrences, this sequence pattern will be a local packing motif. In this dissertation, we adopt the concept of local packing motif proposed by Jonassen et al. as a local structure representation, a sphere with a distance of d Å from a central residue.

3.2.2. Study of Conserved Structure for Functional Classification

Based on the common assumption that proteins of the same function share common local regions, the concept of local region conservation comes from a motif, which is a fragment with biological or functional meaning. In addition, we also try to discover functional site without the help of protein-ligand complexes such as CSA (Catalytic Site Atlas) [39] and Protemot [44]. Therefore, our idea is to apply mining frequent itemset on a group of proteins, and these proteins should share the same function or reactions. Hence, if a protein structure can be decomposed as a set of local structures; frequent itemset mining can be easily applied to discover frequent local structures. The most important issue we should address is how the link could be made between protein function and discovered local structures. Because discovered local structure shares among a group of proteins, it can be viewed as conserved structure for a group. As shown in Figure 6, this is the overall framework for mining conserved local structure.

Representative set Conserved structure

Ⅰ

Ⅱ

Ⅲ Conserved Local Structure

Determination Similar Substructure

Grouping Candidate Substructure

Generation

Ⅰ

Ⅱ

Ⅲ

A Set of Protein Chains

Figure 6. The overall framework for mining conserved local structure.

(29)

3.2.3. Mining General Protein Structural Properties

As we know, protein folds by a series of interaction between amino acids. In the sphere model of local structure representation, residue environment information surrounding a residue can be easily detected. The interactions between amino acids consist of atom interactions and bond connectivity. Therefore, a sphere model is an appropriate representation to describe residue environment. Accompanying with the fast growth of protein structures, it provides more materials on the study of discovering local residue environment with/without chemical bond information.

Residue environment has been studied and applied on protein threading and protein binding site characterization [5]. In the protein structure, a residue is the essential element for conformation, and residue-residue contacts will affect the overall framework of a protein structure. Protein conformation is highly correlated to residue contact with chemical bonds such as covalent bonds, ionic bonds, hydrogen bonds, Van der Waals attractions, or disulfide bonds. Protein structural properties could be discovered in a protein structure or the interaction regions of protein complexes.

3.2.4. Involving the New Approaches of Fast Structure Mining

Because massive pair-wise sequence and structure comparison are time-consumed task, we still have to improve performance for fast structure mining. According to the definition of protein blocks [6] proposed by Brevern et al., the authors try to use protein blocks to understand the sequence-structure relationship and structural alphabet [7] is an improved representation of protein blocks. Therefore, they encode a protein structure into a one-dimensional sequence and they can treat one-dimensional sequence as protein sequence and BLAST can be easily applied. In addition, substitution matrix for structural alphabet is also an issue should be addressed. Currently, our proposed approach applies signature and indexing technique for fast structure mining. The same as conserved structure mining, we use neighborhood residues sphere to describe protein local structure, transform each sphere as bit-string signature, and the indexing technique will be applied to provide fast database search. Furthermore, we encode each neighborhood residues sphere as environmental signature for protein structure indexing and quick database searching.

(30)

3.2.5. Coordination of Sequence Conservation and Structural Conservation

In accordance with MAGIIC-PRO [8] developed by Hsu et al., which is driven by homologues protein sequence analysis on detecting a functional signature, the authors approach sequence pattern mining to discover functional signatures of a query protein.

The authors try to link the relationship between sequence patterns and protein function via the corresponding space information of sequence patters. From this point of view, they use sequence conservation mining to discover functional motif relative to functional site. But another viewpoint we considered is from local conserved structures, we attempt to discover structure conservation with sequence information integration for analyzing the relationship among sequences, structures, and functions in the future. Functional classification would be a better choice to discover structure-function relation because of protein-ligand complex information.

In each enzyme family, proteins within an enzyme family have the same function derived from different species; therefore, it is a good start to discover sequence and structure conservation based on the relationship between sequences, structures and functions.

3.2.6. Apply Mining Results in Function/Structure/Sequence Prediction and Annotation

Computer-aid annotation for protein sequences, structures, and functions has been studied based on protein global sequence and structure information. Recent research has been applied structure properties in primary sequence prediction to improve prediction accuracy. Therefore, our idea starts from protein local sequence and structure to inference its function; therefore, we attempt to include protein structure properties of local region to study the correlation of sequence, structure, and function from the view of local region. In order to annotate protein function, it is alternative to use mining results to predict protein function. This mining result discovered from a group of functional proteins should be significant to its protein function.

(31)

4. RESEARCH DESCRIPTION

4.1. Protein Local Structure Representation 4.1.1. Introduction

As protein function is activated in specific region of protein structure especially in local structure; therefore, local structure comparison plays an important role in detecting local structure similarity. Proteins with the same function should share similar local structure and provide binding area to contact with small molecule in order to activate their functions and these local structures are functional areas. In the past, molecular biologists examine lots of functional protein structures to understand the relationships between functionalities, amino acid sequences and protein structures [42, 68, 69]. These studies not only help molecular biologists understand more details about functional proteins but also provide helpful information while encountering unfamiliar proteins. With the help of fast computing machine and delicate algorithms, research staffs can mining more useful sequence and structure from hand-made protein database and further applied the mined knowledge in protein function prediction, active site prediction and other structure based researches.

With the fast growth of Protein Data Bank (PDB) [3, 56], protein functional analysis has become more important. Moreover, protein structure comparison among mass protein structure data is widely applied on protein structure analysis. According to researches and observations, protein function is highly correlated to its three-dimensional (3D) structure and researches are especially focused on special structure fragments which may connect to protein function or overall framework support [70-72]. Local structure similarity [41] can tell us similar local structure which may highly relate to protein function.

Currently, there are two major directions to analyze protein function; one is sequence-level analysis, and another is structure-level analysis. Mining the conservation area related to possible binding area is a hot issue to infer protein function from protein sequence or protein structure analysis. In sequence-level analysis, sequence alignment can be applied to detect conservation among protein sequence although the conservation is rough area [70]. They try to map sequence conservation region into their corresponding 3D space to link the relation between sequence, structure, and function [73]. Now, the question is that could we discover

(32)

local structure conservation related functional area, and how to discover. In structure-level analysis, the binding area of protein-ligand complex [39, 44] is widely used to identify protein functions via local structure comparison. Scientists first find protein pockets and voids [71, 72], which are possible binding regions of protein function. These regions can be further investigated in ligand docking and proved that discovered local structure conservations are conserved for protein function.

Because homologous proteins may have different functions, it is hard to detect via sequence-based identification if evolution keeps the folding pattern far from sequence identity. Therefore, structure-based identification of homologues would succeed because of structure conservation for keeping protein functionality [74].

4.1.1.1. Motivation

In this study, our motivation is to discover local structure conservation via protein structure analysis. Therefore, we will discuss on local structure representation for structure conservation discovery and related miming approaches or algorithms.

Based on the most believed assumption that proteins of same function share common local structure, we developed a different approach which mining the conserved region from the classified enzyme dataset [75]. Therefore, we try to detect or discover similar local structure via different approaches and local structure representations to mine local structure conservation and find the link between local structure and functional region. Beyond that, we will discuss local structure conservation discovery and relationships between local structures and functional regions.

4.1.2. Local Conservation and Functional Site

As found by Campbell and Jackson [53], Src homology 2 (SH2) family can be divided into two groups on the basis of similarity of binding site residues. In this research, it showed that proteins with the same family share similar local sequences and local structures closed to its binding area. The result also showed that sequence conservation would fall on whole sequence diversely but compact in 3D space. In this case, they observed that there exists conservation on local sequence and its corresponding 3D structure and has relationship between local structure and binding area. Moreover, according to MAGIIC-PRO developed by Hsu et al. [49] on detecting functional signature, they approach sequence pattern mining to discover functional signatures of a query protein. Their experimental results showed that gapped local sequence can be detected that its corresponding local structure might be

(33)

close to protein functional site.

The function often occurs in cavity, packets or voids of proteins. Therefore, the study of protein local structures is helpful for understanding the protein function. It is also a trend to discover relationship between function and protein local structures.

In previous studies, CSA [39] extracts functional site information from research literatures manually; Protemot [44] uses computational approach to detect and extract all protein-ligand complexes in PDB automatically. Another trend on this topic is to discover possible functional areas on protein surface, such as CASTp [72] and pvSOAR [71].

4.1.3. Local Structure Representation

In the task of mining local structure conservation, local structure representation is the first consideration we should regard for. In this study, we first use the straightforward representation of the results derived from protein structure comparison. In addition, we adopt and modify the idea of structural motif of SPratt2 [64]. In SPratt2, they use sphere to describe local structure for discovering structural motif. We will illustrate details in the following sub-sections.

4.1.3.1. Alignment Result of Protein Structure Comparison

To use the alignment result generated by protein structure comparison is the first candidate to mine local structure conservation. While comparing a set of protein structure pair-wisely, we can obtain a set of matched Cα points from each compared pair. And then we can apply simple clustering algorithm to group matched Cα points as local structure. Each group will be a representation of local structure for further investigation.

4.1.3.2. Neighborhood Residues Sphere

In order to depict local structure with an appropriate representation, our original idea comes from the NSr, called a neighbor string, developed by Jonassen et al. [65], which is used to mine structural motif. This string encodes all residues in the structure that are with a distance of d Å from r (d=10, as default), including r itself from N-terminal to C-terminal. We redefine NSr to be NRS, neighborhood residues sphere, which includes structure coordinate information therefore the NRS contains

(34)

local structure information with its sequence. As shown in Figure 7, if a central residue is colored in red and radius is 10 Å, residues within a blue part is neighborhood closed to central residue within 10 Å.

Figure 7. Neighborhood residues sphere.

A real case of protein (PDBID: 1AU0). Residues in blue are surrounded by central residue in red within 10 Å distance.

4.1.4. Structure Conservation Detection

In order to detect protein local conserved structure related to protein function or closed to protein binding area. In previous researches, the believed assumption is that proteins with the same function share similar local structure. Hence, to mining local structure region that have biochemical meaning will be very useful for identifying protein function. Given a set of protein chains, our goal is to extract local structure patterns shared among those protein chains which have the same function and apply the concept of mining frequent itemset to discover structure conservation [76]. In this section, we will introduce two methods of mining local structure patterns; one is using pair-wise protein structure comparison and another is sphere-based conservation mining approach, and will be illustrated in the following sub-sections.

(35)

4.1.4.1. Pair-wise Protein Structure Comparison Approach

In this approach, we use pair-wise protein structure comparison to obtain matched residue, group them as a substructure and check substructure similarity further. Our strategy is to describe local structure representation of matched residues via protein structure comparison and then detect frequent substructure. In addition, we use EMPSC [77] as protein structure alignment tool to compare protein structures pair-wisely. As shown in Figure 8, the overall framework contains three major parts:

(I) local structure generation via pair-wise local structure comparison, (II) substructure comparison and similarity measurement, (III) similar substructure grouping and representative pattern selection.

Similar Substructure Grouping &

Representative Pattern Selection Substructure Comparison &

Similarity Measurement

Local Structure Generation via Pair- wise Local Structure Comparison

Ⅰ

Ⅱ

Ⅲ

A Set of Protein Chains

Figure 8. The flow chart for mining conserved structural patterns via pair-wise protein structure comparison.

4.1.4.2. NRS-based Conservation Mining Approach

In text mining, mining frequent itemset is often applied to find the frequent term in a corpus. But given a set of protein chains (e.g. 4HHB:A.), can we apply a concept of frequent itemset mining on protein chains? In the Figure 9, we illustrate an overall framework for pattern extraction. Given a set of protein chains, our goal is to extract representatives for a set. Those representatives are considered as conserved patterns

(36)

which most of proteins share these substructures. Because the NRS contains sequence and structure information, we can apply analysis method on sequence and structure data. Our strategy is to apply sequence alignment for sequence conservation and then structure alignment for structure conservation. This framework is divided into three major steps to select conserved pattern for a set of protein chains: (I) NRS segmentation, (II) sequence conservation grouping, and (III) representative selection.

Ⅰ

Ⅱ

Ⅲ A Set of

Protein Chains

Pair-wise Sequence Alignment

NRS Segmentation

Sequence Clustering

Structure Alignment

Representation Selection

Conserved Pattern Output

(a)

(b)

(a)

(b)

Figure 9. The flow chart for mining conserved structural patterns via NRS-based conservation mining approach.

4.1.5. Experiments

In order to compare two approaches on detecting structure conservation, we use enzyme classification as our data collection, and approach these two methods to figure out structure conservation in local region and find out the relationship between local structure regions and substrates or ligands. According to PDBSProtEC [13], we randomly select 6 EC families as our dataset to evaluate these two methods. In Table 2, we list all protein chains after removing identical protein sequences for these 6 EC families. In addition, substrate information is selected from PDBSum [12]

(http://www.ebi.ac.uk/thornton-srv/databases/pdbsum/).

(37)

Table 2. List of protein chains for 6 randomly selected EC families.

EC Numbers List of Protein Chains

1.6.2.4 18 1AMO:A 1B1C:A 1BVY:F 1FAG:A 1FAH:A 1J9Z:A 1JA1:A

1JME:A 1JPZ:A 1P0V:A 1P0W:A 1P0X:A 1SMI:A 1YQP:A 1ZO4:A 1ZOA:A 2BF4:A 2BPO:A

1.14.99.3 14 1DVE:A 1DVG:A 1IW0:A 1N3U:A 1OYK:A 1WE1:A 1WNV:A 1WNW:A 1WNX:A 1WOV:A 1XJZ:A 1XK0:A 1XK1:A 1XK2:A 2.3.1.74 12 1BI5:A 1CGK:A 1CHW:A 1CML:A 1D6H:A 1D6I:A 1I86:A

1I88:A 1I89:A 1I8B:A 1JWX:A 1U0V:A

4.1.2.17 14 1DZU:P 1DZV:P 1DZW:P 1DZX:P 1DZY:P 1DZZ:P 1E46:P 1E47:P 1E48:P 1E49:P 1E4A:P 1E4B:P 1E4C:P 1FUA:_

5.3.1.9 13 1B0Z:A 1G98:A 1GZD:A 1IRI:A 1J3P:A 1JLH:A 1N8T:A

1T10:A 1TZB:A 1U0E:A 1X7N:A 1X82:A 1ZZG:A

6.3.2.17 7 1FGS:_ 1JBV:A 1W78:A 2GC5:A 2GC6:A 2GCA:A 2GCB:A

Table 3. Experimental results for local conservation discovery via pair-wise protein structure comparison.

# of local conservation # of ligand contact

PSC based NRS PSC based NRS

1.6.2.4 13 16 3 4

1.14.99.3 5 0 3 0

2.3.1.74 16 0 4 0

4.1.2.17 0 49 0 4

5.3.1.9 7 3 0 0

6.3.2.17 4 6 0 3

4.1.6. Discussions

4.1.6.1. Pair-wise Protein Structure Comparison Approach

In Table 3, we list number of local conservation we found and number of substrate contacts within 10 Å between substrate and discovered local conservation. In the experimental results, not all EC family will discover local conservation because their global structures might be too similar or diversity. The experimental results reveal that we don’t detect in EC 4.1.2.17, and we find these sequences share above 90%

sequence identity within this EC family, checked by BLASTCLUST [36]. Therefore, it is hard to use this approach to detect local conservation because above 90%

sequence identity means that they have the same global structures. In addition, the

(38)

reason why we list the value of number of substrate, ligand, or metal ion is try to connect the relation between local conservations and substrates.

Although we only test few cases on discovering conserved structure patterns of proteins with same function, the result reveals that local structure conservation region could be detected under functional classification. We select all possible substrates information related to protein chains. In Figure 10, the picture shows the relationships between conserved patterns and substrates, and the protein PDBID is 1J9Z:A and substructures are areas colored in yellow, aqua, or lime and the ball colored in red, blue, and navy are substrates (Navy: FAD, Red: NAP, Blue: FMN).

Moreover, we also find that local conservations discovered in proteins of PDBID 1BVY:A, 1AMO:A, 1BU7:A, 1SMI:A, 1B1C:A have substrate/ligand contacts such as FMN, EDO, FAD, HEM, and NAP.

Figure 10. Protein PDB ID 1J9Z:A and its binding substrates.

The areas in red, blue and navy are substrates of NAP, FMN, and FAD respectively, and discovered local conservations in yellow, lime, and aqua respectively.

4.1.6.2. NRS-based Conservation Mining Approach

For each EC family, we apply NRS-based conservation mining approach to mine local

(39)

conservation. Because of large amount of spheres, we first apply sequence alignment to group similar sequence and further check their structure similar within a group via geometric hashing. In Table 3, we also list the values of number of local conservation and number of substrate, ligand, or metal ions respectively. We still have two EC families, EC 1.14.99.3 and EC 2.3.1.74, that local structure conservation could not be detected. In EC 2.3.1.74, their sequences share above 90% sequence identity. And in EC 1.14.99.3, there are still 3 protein chains while the cut-off of sequence identity is below 50%.

As shown in our experimental results, conserved patterns are mined from protein chains with the same EC labels sharing highly conservation in local structure and conserved patterns have high capacity to identify. In addition, we also find that protein chains within the same EC labels can be grouped into more than two sub-groups. For example, while applying this approach on whole EC families, in EC 3.2.1.17, there are totally 895 protein chains, and we mined two conserved patterns.

However, 326 protein chains share one of them, and 417 protein chains share another one, but these two conserved patterns have no overlapping region. According to our observation, number of conserved patterns has relation to the number of protein chains. In general, the more in the number of protein chains within the same EC labels, the lower in the number of conserved patterns, if protein chains within an EC label have diversity.

4.1.6.3. Summarization

As shown in Figure 11, this is PDBID 1SMI:A and the substrate is HEM (PROTOPORPHYRIN IX CONTAINING FE). The area colored in blue is the local conservation discovered by NRS-based conservation mining approach and the central residue is colored in red, and the area color colored in yellow are two local conservation discovered by pair-wise protein structure comparison approach. In addition, the area in pink is the area the overlapping area discovered by these two approaches. Comparing with these two approaches, local conservation detected by pair-wise protein structure comparison approach will be more fragment than NRS-based conservation mining approach. The reason is that NRS is more suitable to describe residue environmental information, but a group of matched residue points just provides local similar area and it is not a well-organized structure representation.

List of Tables

Table of Contents

List of Tables

List of Figures

Abbreviations

1. INTRODUCTION

1.1. Current Status of Structural Genomics

1.2. Sequences, Structures, and Functions 1.2.1. Protein Structure

1.2.2. Sequence, Structure, and Function

1.3. Tackled Issues in this Dissertation

1.3.1. Study of Local Structure Representation

1.3.2. Study of Conserved Structure for Functional Classification

1.3.3. Mining General Protein Structural Properties

1.3.4. Involving the New Approaches of Fast Structure Mining

1.3.5. Coordination of Sequence and Structural Conservation

1.3.6. Apply Mining Results in Function/Structure/Sequence Prediction and Annotation

1.4. Overview

2. LITERATURE REVIEWS

2.1. Sequence, Structure, and Function

2.2. Sequence Motif and Structural Motif

2.3. Structural Property

2.4. Structural Database

2.4.1. Worldwide Protein Data Bank

2.4.2. Enzyme Data Bank

2.4.3. Nucleic Acid Database

2.5. Structural Classification 2.5.1. SCOP

2.5.2. CATH

2.6. Functional Classification 2.6.1. Enzyme Classification

3. THESIS STATEMENT

3.1. Motivation

3.2. Framework of this Dissertation

3.2.1. Study of Local Structure Representation

3.2.2. Study of Conserved Structure for Functional Classification

3.2.3. Mining General Protein Structural Properties

3.2.4. Involving the New Approaches of Fast Structure Mining

3.2.5. Coordination of Sequence Conservation and Structural Conservation

3.2.6. Apply Mining Results in Function/Structure/Sequence Prediction and Annotation

4. RESEARCH DESCRIPTION

4.1. Protein Local Structure Representation 4.1.1. Introduction

4.1.1.1. Motivation

4.1.2. Local Conservation and Functional Site

4.1.3. Local Structure Representation

4.1.3.1. Alignment Result of Protein Structure Comparison

4.1.3.2. Neighborhood Residues Sphere

4.1.4. Structure Conservation Detection

4.1.4.1. Pair-wise Protein Structure Comparison Approach

Similar Substructure Grouping &

Representative Pattern Selection Substructure Comparison &

Similarity Measurement

Local Structure Generation via Pair- wise Local Structure Comparison

Ⅰ

Ⅱ

Ⅲ

A Set of Protein Chains

4.1.4.2. NRS-based Conservation Mining Approach

Ⅰ

Ⅱ

Ⅲ A Set of

Protein Chains

Pair-wise Sequence Alignment

NRS Segmentation

Sequence Clustering

Structure Alignment

Representation Selection

Conserved Pattern Output

4.1.5. Experiments

4.1.6. Discussions

4.1.6.1. Pair-wise Protein Structure Comparison Approach

4.1.6.2. NRS-based Conservation Mining Approach

4.1.6.3. Summarization