CHAPTER 1 INTRODUCTION
1.2 T HESIS OVERVIEW
First of all, we developed a novel kappa-alpha plot derived structural alphabet and a novel BLOSUM-like substitution matrix, called structural alphabet substitution matrix (SASM) in Chapter 2. This structural alphabet was valuable for reconstructing protein structures from just a small number of structural fragments and for developing a fast structure database search method. Besides, this SASM matrix was designed to offer the preference of aligning structural segments between homologous structures that share low sequence identity.
The aligned score from the SASM matrix provides structural similarity estimates and information on evolutionary distance.
In Chapter 3, we described the theory and results of 3D-BLAST based on structural alphabet and SASM. The 3D-BLAST was used to search protein structure database rapidly for all known homologs of a query (new) structure and return a ranked list of alignments. The results showed that our method enhanced BLAST as a search method, using a new structural alphabet substitution matrix to find the longest common substructures with high-scoring structured segment pairs from an SADB database.
In Chapter 4, structural alphabet and SASM was also applied to rapidly identify the structural domains and determine the evolutionary superfamilies of a query protein structure.
The web server we built was named as fastSCOP. fastSCOP was the cooperative integration in 3D-BLAST (a fast structural database search tool) and MAMMOTH (a fast detailed structural alignment tool); the former is required for efficiency and the latter for accuracy.
Chapter 5 presented our current studies about Space-Related Pharmamotif (SRP) in interacting site of protein. The SRP is defined as a set of space-related structural motifs that
prefers a set of similar protein sub-site structures consistently interact with ligand, DNA or peptide. We demonstrated preliminary results of SRP discovery and motif search. These results mainly illustrated the feasibility of studying SRP. Finally, Chapter 6 described some conclusions and future perspectives.
Chapter 2
Kappa-alpha Plot Derived Structural Alphabet and Structural Alphabet
Substitution Matrix
2.1 Introduction
A major challenge facing structural biology research in the post-genomics era is to discover the biologic functions of genes identified by large-scale sequencing efforts. As protein structures increasingly become available and structural genomics research provides structural models in genome-wide strategies [1], proteins with unassigned functions are accumulating, and the number of protein structures in the Protein Data Bank (PDB) is rapidly rising [4]. The current structure-function gap highlights the need for powerful bioinformatics methods with which to elucidate the structural homology or family of a query protein by known protein sequences and structures.
The three-state secondary elements, namely α-helix, β-sheet, and coils, are rather crude for predicting protein structure, and it is not possible to make use of these elements in three-dimensional (3D) reconstruction without additional information. Many approaches have been proposed to replace three-state secondary structure descriptions with various local structural fragments, also known as a 'structural alphabet' [17-23], which can redefine not only regular periodic structures but also their capping areas. Such studies have described local protein structures according to various geometric descriptors (for example, Cα coordinates, Cα
distances, α or φ, and ψ dihedral angles) and algorithms (for example, hierarchical clustering, empirical functions, and hidden Markov models [HMMs] [18]). Many of these methods involve protein structure prediction; an exception is the SA-Search tool [15], which is based on Cα coordinates and Cα distances, and which adopts a structural alphabet and a suffix tree approach for rapid protein structure searching.
To address the above issues, we have developed a novel kappa-alpha (κ, α) plot derived structural alphabet and a novel BLOSUM-like substitution matrix, called SASM (structural
alphabet substitution matrix), for BLAST [6], which searches in a structural alphabet database (SADB). This structural alphabet is valuable for reconstructing protein structures from just a small number of structural fragments and for developing a fast structure database search method called 3D-BLAST. This tool is as fast as BLAST and provides the statistical significance (E-value) of an alignment, indicating the reliability of a hit protein structure. For the purposes of scanning a large protein structure database, 3D-BLAST is fast and accurate and is useful for the initial scan for similar protein structures, which can be refined by detailed structure comparison methods (for example, CE [9] and MAMMOTH [10]).
2.2 (κ, α)-map cluster and structural alphabet
For coding the structural alphabet and calculating the substitution matrix, a pair database of structurally similar protein pairs with low sequence identity was obtained from SCOP 1.65 [43]. Of 2051 families in four major classes (all α, all β, α+β, and α/β) with <40% sequence homology to each other, we excluded a number of problem entries, including poor-quality structures, entries with residue numbering problems, and small-sized families (i.e., number of domains <2). We selected 674 structural pairs (i.e., 1348 proteins) based on the following criteria: (1) one pair was selected for each family, and one extra pair was selected for a family having >15 domains; (2) pairs must have <40% sequence identity; (3) pairs must have rmsd
<3.5 Å, with >70% of aligned resides included in the rmsd calculation. In total, these protein pairs had an average sequence identity of 26% (462 pairs below 30% identity), an average rmsd of 2.3 Å, and average aligned residues of 90% (207,492 aligned residues out of 230,915 residues). The amino acid composition of these 1348 proteins was similar to that of proteins in the Swiss-Prot database.
2.2.1 (κ, α)-Map
A structure fragment (five residues long) was defined by the (κ, α)-pair angles as shown in Figure 2.1. The κ angle, ranging from 0° to 180°, of a residue i is defined as a bond angle formed by three Cα atoms of residues i – 2, i, and i + 2. The α angle, ranging from –180° to 180°, of a residue i is a dihedral angle formed by the four Cα atoms of residues i – 1, i, i + 1, and i + 2. A specific series of structural fragments, called the (κ, α) map, represents a protein structure. Therefore, each protein structure may form a specific (κ, α)-map distribution as shown in Figure 2.2.
κ α
+ 2
C α i + 1
C α i
C α i
− 1
C α i
− 2
C α i
Figure 2.1 Definition of the kappa (κ) and alpha (α) angles.
To code the structural alphabet and calculate the substitution matrix we selected 674 structural pairs (1,348 proteins), which are structurally similar and with low sequence identity, from SCOP based on two criteria: pairs must have rmsd under 3.5 Å, with more than 70% of aligned resides included in the rmsd calculation; and pairs must have under 40% sequence identity. The accumulated (κ, α)-map matrix (Figure 2.3) consists of 225,523 protein fragments derived from 1348 proteins. When the angles of (κ, α) are divided by 10°, this matrix has 648 cells (36*18). The fragment frequency of each cell in this matrix is unbalanced because the protein structures are significantly conserved with regard to α-helix (82,843 segments) and β-strand structures (52,371 segments). Of these helix segments, 71.1% (58,897 segments) are located in four cells that contain 22,310, 15,736, 13,013, and 7,838 segments.
0
Figure 2.2 The (κ, α) distribution map of 1brbI (square) and 1bf0 (circle).
In the study, the structural distance of a pair of 5-mer protein segments i and j is determined from the rmsd value of the five Cα atom positions, and is given as follows:
( ) ( ) ( )
and j, respectively. The structural distance is also used to define the intra-segment and inter-segment distances.Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z L Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z S S S S S W W L W L L L L L I I L Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z S S S S S S W W W W W W L L L L L I I L L L R Z Z Z Z Z Z Z Z Z Z Z Z S S S S S S S S W W W W W W W L D A C I I L L Q R R Z Z Z Z Z Z Z Z S S S S S S S S S S W W W V V V V M D D B D L Q Q Q Q Q R Z Z Z Z Q S S S S S S S S S S V V V V V V V V V V M G G G G Q Q Q Q Q Q Q Q Z Z P P S S S S T T T T T T T V V V V V V V V M G M G G Q Q Q Q Q Q Q Q R Q P P P P P P T T T T T T V V V V V V V M M M M M M Q Q Q Q Q Q R R Q P P P P P P P T T T T T T T T T V V V V V X M X M M M M X X X R R R R R P P P P P P T T T T N N N T X X X X X X X X X X X X X X X X X X X X X R R P N P T T T N N K K K K K K K X K X X X X Z X X X X X X X X X X X X N N N H N N N K K K K K K K K K Z Z Z Z Z Z Z Z Z Z Z Z X Z X X X X X H H H H H H H K F F K K K K K Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z X X X X H H H H E E E F F F K K Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z X X X X N H H E E F F F F Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z N N H H
Alpha
Ka p p a
Figure 2.3 The distribution of accumulated (κ, α) plot of 225,523 segments derived from the pair database with 1,348 proteins.
2.2.2 Structural Alphabet
We aimed to use the structural alphabet to represent pattern profiles of the backbone fragments by clustering the accumulated (κ, α)-map matrix (Figure 2.3). A nearest-neighbor clustering (NNC) algorithm was developed to cluster 225,523 fragments in the accumulated (κ, α)-map matrix (Figure 2.3) into 23 groups using the following steps and goals: (1) identifying a representative structural segment for each cell in this matrix; (2) clustering 648 representative segments into 23 groups by grouping similar representative segments and restricting the maximum number of segments in a cluster; (3) in each cluster, identifying a representative segment based on the cell weight which is defined as wi =(1Si)
∑
Mj=1(1 Sj), where Si is the number of segments in cell i and M is the number of cells in this cluster; (4)assigning the representative segment of a cluster to a structural letter (Figure 2.4); (5) obtaining a composition of 23 structural letters that is similar to the 20 common amino acids.
We developed an NNC algorithm instead of using a standard clustering algorithm, such as a hierarchical clustering method or a K-means, which is unable to satisfy the factors (2), (3), and (5).
Others
Helix-like
Strand-like Helix
Strand
Figure 2.4 The representative 3D fragments of 23 structural alphabets.
3D-BLAST used BLAST as the search method and was designed to maintain the advantages of BLAST. However, 3D-BLAST is slow if the structural alphabet is un-normalized, because the BLAST algorithm searches a statistically significant alignment by two main steps [7]. It first scans the database for words that score more than a threshold value if aligned with words in the query sequence; it then extends each such 'hit' word in both directions to check the alignment score. To reduce the ill effects of using an un-normalized structural alphabet, we set a maximum number (γ) of segments in a cluster in order to have similar compositions for the 23 structural letters and 20 amino acids. The value of γ was set to 16,000 (about 7.0% of total structural segments in the pair database).
According to the restriction parameter γ, the cell with the highest number of segments
(22,310) in the accumulated (κ, α)-map matrix should be divided into two subcells by equally separating the κ and α angles: one is located in 100° ≤ κ <105° and 40°≤ α < 45° , and the other is in 105° ≤ κ <110° and 45°≤ α < 50°. These two subcells were labeled as structural letters A and Y, respectively. The NNC method was then applied to cluster the remaining 203,213 fragments into 21 groups. A representative segment of each cell in the accumulated (κ, α)-map matrix was first determined. For each cell, a segment distance matrix (d), stored with the rmsd values by computing all-against-all segments, was created. And the size was N
× N, where N is the total number of the segments in a cell. An entry (dij), which represents the structural distance of segments i and j, is computed by the rmsd of five Cα atom positions and isgiven as and j, respectively. For each segment i, the sum of distance (di) between the segment i and the other segments in this cell is
∑
=N
m1dim. The segment with the minimum sum of distance is selected as the representative segment of a cell. After the representative segment of each cell is identified, a distance matrix (D) is stored with the rmsd values by computing all-against-all representative segments for these 647 segments. Each entry (Dij, 1≤i, j≤647) is a measure of structural similarity, as defined in Equation 1, between representative segments i and j. In order to ensure that the 3D conformations of the segments clustered in the same group are similar, an rmsd threshold (ε) of the structural similarity is set to 0.5.
Based on the distance matrix D and restriction parameters (ε and γ), the NNC method works as follows: (1) Create a new cluster (Ci, 1≤i≤20 ) by first selecting an unlabeled cell (a) with the maximum number of segments. Label this cell as Ci. (2) Add an unlabeled cell, which is the nearest neighbor (i.e., a minimum rmsd value in row a of matrix D) of the cell a, into this cluster if this rmsd value is less than ε, and the sum of segments in this cell is less than γ. Label this cell as Ci. Repeat this step until an added cell violates the restriction thresholds, ε or γ. (3) Repeat steps 1 and 2 until the number of clusters equals 21 or all of the cells are labeled. (4) Assign all of the remaining unlabeled cells to a cluster C22. Here, ε = 0.95 Å and γ = 16,000.
Finally, we determined a representative segment and assigned a structural letter for each cluster. For each cell i in a cluster, its sum of distance (Di) with all of the other cells in the
same cluster is equal to
∑
= Nm1wiwmDim, where M is the total number of cells in a cluster, wi is the cell weight, and Dim is the structural distance between representative segments i and m of the cells i and m, respectively. The segment with the lowest sum of distance is selected as the representative segment of this cluster. We sequentially assigned a structural letter for each cluster except J, O, and U, since these three letters are not used in BLAST. Figure 2.3 shows the distribution of these 23 clusters and the structural alphabet on 648 cells in the (κ, α) map.
Figure 2.4 shows the 3D conformation of each structural segment.
Our new NNC methods, (κ, α) map, and the structural alphabet are easily applied to build new SADB databases from known protein structure databases. We have created several SADB databases derived from PDB, a non-redundant PDB chain set (nrPDB), all domains of SCOP1.69, SCOP1.69 with <40% identity to each other, and SCOP1.69 with <95% identity to each other.
Figure 2.5 Structural alphabet substitution matrix (SASM).
2.3 Structural Alphabet Substitution Matrix (SASM)
A substitution matrix is the key component of a protein alignment method. In general, a
similar underlying mathematical structure is used to construct these matrices [44]. Here, we developed a Structural Alphabet Substitution Matrix (SASM) (Figure 2.5) by applying this mathematical structure to a structural pairing database consisting of 207,492 structural letters derived from 207,492 structural segments based on the aligned residues in the pair database.
This SASM matrix was designed to offer the preference of aligning structural segments between homologous structures that share low sequence identity. The aligned score from the SASM matrix provides structural similarity estimates and information on evolutionary distance.
The entry (Sij), which is the substitution score for aligning a structural letter i, j pair (1≤i, j≤23), of the SASM matrix is defined as
ij ij
ij e
S =λlog2 q , where λ is a scale factor for the
matrix. qij and eij are the observed probability and the expected probability, respectively, of the occurrence of each i, j pair. The observed probability is
∑ ∑
= =23 total number of letter i, j pairs in these 207,492 structural letters. The expected probability is pipj for i = j and 2pipj for i ≠ j , where pi is the background probability of occurrence of letter i. The pi is given asqii +
∑
23k≠iqik 2. The substitution score is greater than zero (Sij > 0) if the observed probability is greater than the expected probability. By contrast, Sij < 0 if qij < eij. The optimal λ value is yielded by testing various values ranging from 0.1 to 5.0; is set to 1.89 for the best performance and efficiency. The final score Sij is rounded to the nearest integer value.2.4 Evaluation of (κ, α)-Map and Structural Alphabet
The goal of creating a structural alphabet is to define the 3D structure of fragments of the protein backbone and then represent a protein structure in 3D by a series of structural letters.
A structural letter represents pattern profiles of the fragment backbones (five residues long) derived from the pair database; therefore, a protein structure of L residues is described by a structural alphabet sequence of L-4 letters. Here, we used the pair angles, κ (from 0° to 180°) and α (from –180° to 180°) as shown in Figure 2.1, to divide a 3D protein structure into a series of 3D protein fragments.
Figure 2.3 shows the accumulated (κ, α) map matrix (648 cells) of 225,523 3D segments derived from 1348 proteins in the pair database when the κ and α angles are divided by 10°.
The number of 3D segments in each cell ranges from 0 to 22,310, and the color bar on the right side shows the distribution scale. According to the definitions in DSSP, the numbers of α-helix and β-strand segments are 82,482 (36.57%) and 52,371 (23.33%), respectively. In this (κ, α) map, most of the α-helix segments are located on four cells in which the α angle ranges from 40° to 60° and the κ angle ranges from 100° to 120°. In contrast, the κ angle of most of the β-strand segments ranges from 0° to 30°, and the α angle ranges from –180° to –120° or from 160° to 180°. The number of cells having no segments is 183. We observed that most of the 3D segments in a cell have similar conformations; that is, the root-mean-square deviation (rmsd) is less than 0.3 Å on five contiguous Cα-atom coordinates. Moreover, the conformations of 3D segments located in adjacent cells are often more similar than ones in distant cells. These results indicate that the (κ, α) map matrix is useful for clustering these 3D segments and for determining a representative segment for each cluster.
0 30 60 90 120 150 180
-180 -120 -60 0 60 120 180
Alpha
Kappa
1RZF-L 1J41-A
Figure 2.6 The (κ, α) plots of an all-α protein (Protein Data Bank [PDB] code 1J41-A; red) and an all-β protein (PDB code 1RZF-L; blue).
Each structure has a specific (κ, α) plot (Figure 2.6) when governed by these two angles.
For instance, a typical (κ, α) plot (blue diamond) of an all-β protein (human anti-HIV-1 GP120-reactive antibody E51, PDB code 1RZF-L [45]) is significantly different from that
(red cross) of an all-α protein (human hemoglobin, PDB code 1J41-A [46]). Conversely, two similar protein structures have similar (κ, α) plots.
The (κ, α) plot is similar to a Ramachandran plot, based on the following observations.
First, the α-helices are located in very restricted areas, in which α ranges from 40° to 60°, and κ ranges from 100° to 120°. Additionally, β-sheet segments are restricted to some regions in the (κ, α) plot. All residues are fairly restricted in their possibilities in both plots. Second, angles φ and ψ in the Ramachandran plot, denoting a protein structure with a series of 3D positions of amino acids, are widely adopted to develop various structural segments (blocks).
Here, the (κ, α) plot was utilized to develop a structural alphabet, which represents a protein structure as a series of 3D protein fragments, each of which are five residues long. The angles φ and ψ represent the position relationship of two contiguous amino acids, whereas the angles κ and α represent the position relationship of five amino acids. These observations indicate that the (κ, α) plot is an effective means of both developing short sequence structure motifs and assessing the quality of a protein structure.
Helix Helix-like
Strand Strand-like
Figure 2.7 The three-dimensional (3D) segment conformations of the five main classes of the 23-state structural alphabet.
A set of representative segments with 23 states and its respective structural letters are identified (Figure 2.7) after performing the NNC method. Here, this 23-state structural alphabet was adopted for both protein structure reconstructions and protein structure database searches. The intra-segment structural distances (blue) are much greater than the inter-segment structural distances (Figure 2.8), and the average rmsd values of these 3D representative segments located in the same (or similar) cluster are frequently below 0.8 Å.
The composition of the 23-state structural alphabet resembles that of the 20 amino acids obtained from the pair database. The distribution of the 23-state structural segments is consistent with that of the eight-state secondary structures defined by the DSSP program.
0 0.5 1 1.5 2 2.5
A&
Y B C D E F G H I K L M N P Q R S T V W X Z Structural alphabet
RM SD ( Å )
intra inter
Figure 2.8 The average intra-segment and inter-segment root mean square deviation values of the 23-state structural alphabet.
Based on the (κ, α) plot and a new nearest neighbor clustering, a new 23-state structural alphabet was derived to represent the profiles of most 3D fragments, and was roughly categorized into five groups (Figure 2.7): helix letters (A, Y, B, C, and D), helix-like letters (G, I, and L), strand letters (E, F, and H), strand-like letters (K and N), and others. The 3D shapes of representative segments in the same category are similar; conversely, the shapes of different categories are significantly different. For instance, the shapes of representative 3D
segments in the helix letters are similar to each other, as are those in strand alphabets. In contrast, the shapes of helix letters and strand letters obviously differ. The average structural distance (determined from the rmsd value of five continuous Cα atom positions between a pair of 5-mer segments) of inter-segments in both helix and strand letters is less than 0.4 Å (Figure 2.8), and is much less that those of other letters in the structural alphabet. Additionally, most α-helix secondary structures based on the definition of the DSSP program are encoded as helix or helix-like alphabets, and none are encoded as strand or strand-like alphabets (Figure 2.9). Conversely, most β-strand segments are encoded as strand or strand-like letters.
0
Figure 2.9 The distributions of the 23-state structural alphabet on α-helix, β-strand, and the coil segments defined by the DSSP program.
All residues were fairly restricted in their possibilities in the (κ, α) plot (Figure 2.3). The proportion of cells with 0 segments, which were encoded as structural letter 'Z', was 28.2%
(183 cells among 648). Additionally, the numbers of cells and segments with structural letter 'Z' were 272 (42.0%) and 989 (0.4%), respectively. Restated, only 0.44% segments were widely distributed in 41.98% of cells. If the segments of a new protein structure are located on these 41.98% cells, then they may be regarded as poor structural segments. Conversely, five helix letters (A, Y, B, C, and D) and three strand letters (E, F, and H) were located in 7 and 30 cells (Figure 2.3), respectively. The total number of segments located in these 37 (4.4%) cells was 75,477 (33.5%).
The distribution of a structural alphabet is a key determinant of speed in 3D-BLAST.
Since the structure database contained high percentages of α-helix and β-strand structures, we restricted the maximum number of structural segments in a cluster for the NNC algorithm to increase the speed of 3D-BLAST. A structural letter, which represents all of the α-helix segments, will occupy 36.57% of total segments without the restriction based on the NNC algorithm. Here, the restriction maximum number of segments was set to 16,000, which is
~7% of the total segments according to the distribution of 20 amino acids. In the structural alphabet, there are 8 letters (the helix and helix-like) for the α-helix structure and 5 letters (strand and strand-like) for the β-strand structure (Figure 2.4). 3D-BLAST is ~64 times faster if the restriction is applied to the NNC method.
~7% of the total segments according to the distribution of 20 amino acids. In the structural alphabet, there are 8 letters (the helix and helix-like) for the α-helix structure and 5 letters (strand and strand-like) for the β-strand structure (Figure 2.4). 3D-BLAST is ~64 times faster if the restriction is applied to the NNC method.