METHODS AND MATERIALS - 建立蛋白質資料庫與知識庫

1. Source files( protein)

2.Local Data Bases

Fig.2 the overview of our proteomic knowledge base research

The Fig.2 show the whole data mining process of constructing the knowledge base. In this thesis, the protein information was gathered from Protein Data Bank (PDB) and saved them into local databases. With data preprocessing methods, I preceded the protein structural information and translated them into our features, like phi, psi, and omega. Besides, the properties of amino acids were collected in our database and joint with our protein database. After preparing these data, we designed the data mining tool to mining the databases with feature selection strategy to construct the protein knowledge.

After preparing protein structural database, we began to design the data-mining tool and setup our own knowledge base of proteomic research. The Fig.3 shows the overview of our methods for constructing the knowledge base. Some proteomic researches use the same framework to find the description of protein to setup their own knowledge base.

In my thesis, I demonstrated the one data mining tools and its purpose was help researchers to discover the well-represented description of protein structure, sequence, and functionality which can simplify the protein structural and represent the protein structure as different blocks instead of using the three elements, (helices, loop, and sheet) to describe the protein structure. We can compare the description and find

3. Data Preprocessing and warehousing

6. Solving Problems with mining tool

5. Feature Selection

4. Feature extraction

knowledge

the conserved pattern of transformed-sequences from different proteins structure.

More detailed methodology will be shown in the section B to C.

Structural

Fig.3 The detailed framework of our data mining flow

After showing the flow of our constructing our knowledge base, we begin to discuss detailed methodology in the following sections.

B. Data Preprocessing

Firstly, we got the raw data from the Protein Data Bank, which is the well-known database. We downloaded them on our local site and parsed them into our own database. The files of Protein Data Bank have its own format (see PDB guide 2.1).

According to the guide of protein data bank format, I build the EER model for this Protein Data Bank. I roughly select some title from the protein data bank.

I used the PERL program to process these data and saved them into the MYSQL database server. Additionally, the Fig. 3. shows the EER model of our proteomic database. After constructing such the database, we also use some data

mining technique to extend these EER model.

Fig.4 the EER model of our local proteomic database.

In our local database, we also provided some basic query(see Fig. 4.), the web site has provided the keyword query, sequential PDB ID query and PDB ID query for searching the interesting protein. The interface of our database also provided some visualization service (see Fig. 5.) are provide by CHIME plug-in tools. The interfaces also provided some scripts icon to help user realize the property of protein structure. Additionally, it also provided downloading service that you can download the .atom file or the .pdb file from our web site.

Moreover, we integrated the Database, Knowledge Base, and mining tools for my research purpose. The integrating the Knowledge Base, mining tools and Database will demonstrate in the Chapter 5.

After preparing the database, we calculated the dihedral angle from the atomic coordinates of protein structure and also collected the spatial information and property of 20 Amino Acids. We started designing other tools and setting up our own databases and knowledge base with these protein information.

Fig. 4 the interface of our proteomic database.

Fig. 5 the visualizing interface of our proteomic database.

With this web-based platform, we can study and practice the large protein database management. Also, we can also provide the protein structure researchers a web based platform.

C. Proposed combinatorial approach : SUM-K

a. Framework of SUM-K

We begin to designing the powerful mining tool to discovering the knowledge from our database and constructing our own Knowledge Base to help researchers realizing the database.

The workflow of SUM-K system are shown on Fig. 1.

Fig. 6.

c.Train SOM on Protein Fragment Vectors

d.Visualizing trained SOM with U-Matrix in Grey Levels

e.Build Minimum Spanning Tree from U-matrix

g.Use number of subtrees as K and Run K-mean Algorithm on Input

i.Transform Proteins into Structural Alphabet

h.Define Structural Alphabet based on K-means Clusters

f.Partition Minimum Spanning Tree into Disconnected Subtrees a.Backbone Transformation into

Dihedral Angles

We designed the SUM-K (SOM, U-matrix, MST, and K-means algorithm)

approach based on the Self Organizing Maps and its own visualization methods to discovering the whole new description of protein structure. We can do some value-added visualization on existing database, like SCOP database, and finding the building blocks through the non-redundant proteins with SUM-K.

a. Backbone Transformation into Dihedral Angles

Fig. 7 the backbone transformation into Dihedral Angles.

The calculation of dihedral angles is shown as follows: (shown as Fig. 7)

1) The Φ angle is the angle between the plane determined by N, Cα, and C of the i th Amino Acids and the one determined by N of i+1 th Amino Acids, Cα, and N of i th Amino Acids.

2) The Ψ angle is the angle between the plane determined by C, Cα, and N of the i th Amino Acids and the one determined by C of i+1 th Amino Acids, Cα, and N of i th Amino Acids.

There are 2 dihedral angles between 2 amino acids. The SOM algorithm clusters the amino acids of the protein according to these three features. In this Thesis, we used the phi and psi angle as our training feature for setting up protein structural alphabet systems and determined the building blocks with SUMK algorithm.

b. Protein Fragment Vectors Extraction as Input to SOM.

With the fixed window size of five residues, we slid the window along each all-α protein in SCOP, advancing one position in the sequence for each fragment, and collected a set of overlapped 5-residue fragments. As the relation between two successive carbons, and , located at the ith and (i+1)th positions, can be

i i+1

C_α C_α

defined by the dihedral angles ψi of and φ

C_αi i+1 of , a fragment of L residues can then be defined as a vector of 2(L-1) elements. Thus, in our study, each protein fragment, associated with α-carbons , , , and , is represented by a vector of eight dihedral angles, i.e.

+1 figure.) Based on this representation, we totally gathered 1,143,072 fragment vectors from the all-alpha family in SCOP database. The features are shown as Fig 9.

Fig 8. Showing the dihedral angles we got from the protein structural fragment (window size = 5)

c.Train SOM on Protein Fragment Vectors

Our tool was based on the Kohonen’s self-organizing maps and use the som-tool-pak 3.1 for our research purpose. The kohonen’s self-organizing map was the unsupervised clustering and it can setup 2D map for representing the relationship of each clusters.

The SOM algorithms are shown as follow:

1. Normalize the input data and setup the 2D map with N * M size.

2. Randomly initializing the values of each node on the map.

3. Enter the first training step, (learning steps). The learning rate in this step is higher than that in the second learning step and the purpose of this step is let each unit of the map memorizing the feature of input vector. Through these steps, each output node will be updated by the neighborhood function.

The detailed calculation is shown below:

We will calculate the Euclidean distance between each input vector and the output map unit and get the nearest map unit C as winner unit.

C D

Min

Arg( ( _i))= ---(2)

The input vector the best match node with learning rate will update the winner unit and the neighborhood of winner node will be update by the neighborhood function (shown as formula (3)). After 1000 times later, SOM will converge.

The neighborhood function decline through the times and the region of update will be declined also and framework of SOM is shown as Fig 10.

))M_i(t+1)=M_i(t)+η_ci(t)×(X(t)−M_i(t ---(3)

Where η_ci =α(t)×exp(− R_i −R_c ²/2σ²(t))---(4)

The purpose of first training step maximizes the distance among the centers of cluster

4. The purpose of the next training state is tuning and minimizes the distance of input vectors and the center of cluster.

5. End of the algorithms through the first and second training state.

Fig 10. the overview of the SOM clustering algorithms.

))

Input vector (protein structural fragments) Output node

There are 7 parameters was considered to affect the SOM recognize results., They are map size, first learning rate, second learning rate, first training steps, second training steps, update topology, and update function.

According to the SOM algorithms, It shows that SOM usually consists of a regular 2D grid of so-called map units, each of which is described by a reference vector mi = [mi1, mi2, mi3,…, mid], where d is the input vector dimension, e.g., d = 8, in our case of fragment vectors. The map units are usually arranged in a rectangular or hexagonal configuration. The number of units affects the generalization capabilities of the SOM, and thus is often specified by the researcher/user. It can vary from a few dozen to several thousands. An SOM is a mapping from the ensemble of input data vectors (X_i=[x_i1, x_i2, x_i3,…, x_id] ∈ R^d) to a 2D array of map units. During training, data points near each other in input space are mapped onto nearby map units to preserve the topology of the input space [19][20]. The SOM is trained iteratively.

d. Visualizing trained SOM with U-Matrix in Grey Levels

Fig 11. The calculation of U-matrix (take 4X4 SOM map for example) The Unified Matrix (U-matrix) usually used in SOM visualizing researches, and it can reflect the clustering result of SOM. U-matrix is one type of distance matrix that can show the difference between two map units on the SOM maps. Fig 11. is shown the example of calculation the U-matrix on the 4 X 4 SOM maps with hex able topology type. Take the point A (2,2) for example, there are six neighbor map units near the point A was (1,1), (1,2), (2,1), (2,3), (3,1),and (3,2). Each node on the SOM map has six neighbor nodes except the marginal map units and we connected the target point with its neighbor points to constructing connected components for Minimal Spanning Tree. Finally, we constructed the U-matrix which each map units connected with their neighbor map units (shown as Fig 11.) and each connection has

its distances between two nodes. Then, the connected components was produced from U-matrix. (See Fig 12.)

Figure 12. The connected map units of U-matrix

In order to showing the clustering results, we also normalized the distance of each two neighbor map units into 256 gray levels. The results were shown below and there are apparently six clusters on the SOM maps. We preferred to recognize the number of clusters with computer instead of human eyes and so we decided to use the MST algorithms with certain gray level determining the number of clusters.

Figure 13. the U-matrix visualization with the 0-255 gray level

f. Partition Minimum Spanning Tree into Disconnected Sub trees

After the above steps, the connected units of SOM maps generated. We used

them as input of Kruskal’s MST algorithm. The MST algorithms will generate the shortest path of connected units of SOM maps. The results were shown as the following Figure.

Figure 14. Formation of Minimal Spanning Tree.

The Fig 14. is shown the completely MST tree of SOM map units. However, the number of clusters should be generated. We defined the threshold value of gray level, which is 47 and cutting the MST tree into several sub trees. After cutting the MST trees, the result of the following can be generated (see Fig 15.). Still, there are still some small sub trees, which are not big enough to form a cluster on the SOM map.

Figure 15. Partition the MST tree into sub trees with gray level 47

To deleting these clusters, the sub trees which size are smaller than (map size)

^1/4 would be excluded and the final MST pruning results was generated (see Fig 16).

The following figure was filtered MST pruning results.

Fig 16. Cleaning the MST tree with Tree size < (map size) ^1/4 With the FIND SET Algorithm, the parent of each sub trees and the number of sub trees can be recognized. Then, The number of sub trees means the number of clusters on the SOM map and the number of cluster is generated from this step. With the ability of visualization on the SOM with U-matrix and MST (SUM), the number of clusters preparing for K-means was produced. (See Fig 17)

1

2 6 4

3 5

Fig 17. Determine the number of clusters on SOM map

g. Use number of sub trees as K and Run K-mean Algorithm on Input Vectors Then, the map size expanded from 10X10 to 260 X 260 with SUM till the most frequent number of clusters appeared. The frequent number of cluster is K. We finally take this K as the K parameter of the K-means algorithm. After recognizing the frequent number of cluster, we run the K-means Algorithms with K on input vectors.

Because the different random seeds number on the k-means lead different center of clusters, we started to maximize the center of clusters and minimize distance of the input vector and the center of clusters and found the best cluster center as our final clusters center.

h. Define Structural Alphabet based on K-means Clusters

We will assign an alphabet to each k-means centers with finding the best cluster center. Then, we reassigned the input vector to the closest centers and give them a alphabet. The formula is shown at h1.

Through these steps, we can transform raw amino acids sequence to structural alphabet.

i. Transform Proteins into Structural Alphabet

10 20 30 40 50

---|

YFDYVAGALA aaaaaaaaaq

Fig 18.transformation the protein raw sequence into our own structural alphabets

After reassigning each amino acids to the center of k-means clusters, the transformed sequences for each protein domain from all-alpha family of SCOP databases was produced. With these sequences we could do the comparative study of different alphabet system with protein structures alignment through 1D structural alphabet sequences.

在文檔中建立蛋白質資料庫與知識庫 (頁 13-27)