• 沒有找到結果。

Pei-Ken Chang, Chien-Cheng Chen and Ming Ouhyoung

2. Previous Work

In general, structure alignment based on 3D structure has been shown to be NP complete by

alignment are needed. Fisher et al. [13] used geometric hashing for a Cα-only representation of protein structure, and a follow-up is described in Tsai et al.

[14]. Their method is based on preprocessing and recognition algorithms of complexity O(n3), where n is the number of residues of interest. Later, Pennec and Ayache [15] [16] introduced a 3D reference frame attached to each residue, which reduces the complexity of recognition to O(n2). Shindyalov and Bourne [17]

proposed a method that involves a combinatorial extension (CE) of an alignment path defined by aligned fragment pairs (AFPs) rather than the more conventional techniques which use dynamic programming and Monte Carlo optimization.

Combinations of AFPs that represent possible continuous alignment paths are selectively extended or discarded thereby leading to a single optimal alignment.

Zemla [18] proposed LGA (local-global alignment) algorithm, where longest continuous sequence is first found, and then a second step called GDT (global distance test) is applied. Both longest segment of residues under selected RMSD (root mean square distance) and largest set of equivalent residues that deviate less than a given distance threshold are obtained. Blankenbecler et al. [19] proposed to use fuzzy alignment variables and iterative minimization of a cost function. Milik et al. [20] used graph matching and represented atoms as nodes and bond distance as edge labels. The search method is based on comparison of local structure features of proteins that share a common biochemical function, and so does not depend on overall similarity of structures and sequences of compared proteins.

From the above survey, it is clear that all the above papers are concerned with proteins, and complexity reduction in alignment according to features of proteins or segments of aligned one dimensional sequence. Therefore, they can not solve the general molecule alignment problem unless the tools are modified.

3. Algorithms

In this paper, we propose a tool to align two molecules based on their 3D structural data. The alignment problem between two molecules A and B is solved in two steps: Geometric Hashing and a fine tuning process. Geometric Hashing globally finds initial matching of approximately overlapped atoms.

Thus, parts of molecule A can be matched to parts of molecule B. Secondly, the fine tuning process is based

number of overlapped atoms within a given distance threshold can not be increased any more.

3.1. Geometric Hashing: Step One

Geometric hashing algorithm is introduced to structurally align two molecules. Geometric hashing algorithm is a technique originally developed in computer vision for object recognition and can easily be made parallel [21] [22]. In short, the geometric hashing algorithm is composed of two stages:

preprocessing and recognition. The basic idea is to store in a database at preprocessing time a redundant representation of the models by rigid transformation.

By doing so, the representation of the query object processed at recognition time will present some similarities with that of some database models.

Matching is possible even when the recognizable database objects have undergone transformations or when only partial information is present.

Often the two interesting molecules are both proteins, so we will illustrate the solution in such a situation first. For some cases, e.g. molecular mimicry, two molecules belong to different type, there would be some variance while calculating, and we will describe later.

The three atoms N, Cα and C in each amino acid form a triangle which uniquely defines the position and orientation of the amino acid in the three-dimensional structure of a protein. Since the length of N−Cα and Cα−C are fixed, and N−Cα−C bond angle is also changeless. As alignment considered, the correspondence between two triplets of points in three-dimensional space is sufficient to uniquely determine a rigid transformation. With this mechanism, we can choose a single residue as a basis. A basis is calculated by the following steps and illustrated in Figure 1(a).

1. Normalize NCJJJJJKα to eJK1

There are two phases, preprocessing and recognition, in the geometric hashing algorithm. To solve the problem of representation by different reference coordinates, coordinate information based on different reference frame of a model is encoded in the preprocessing phase and stored in a large memory, in this case, a hash table. The contents of the hash table

offline to reduce the time needed for recognition.

Accessing to the memory is based on geometric information that is invariant of the object’s pose and computed directly from the scene. During the recognition phase, the method accesses the previously constructed hash table using the indices of the encoded coordinate information of the input object and finds their common spatial features.

In the phase of preprocessing, we calculate one basis for each residue to generate coordinates for each atom in a protein. In the phase of recognition, we choose a reference frame of the protein B. For each different reference frame of protein A in the hash table, we accumulate the number of matched atoms by checking whether there are two atoms close enough.

We set a threshold distance MatchThres (MatchThres

= 1 to 2Å is proper), beyond which atoms will not be considered as a match. If no atoms can be matched within MatchThres, we assign the score to 0. If there is an atom within MatchThres, we assign the score to 1.

The process is repeated with each reference frame of the protein B until all the reference frames of these two proteins have been tested.

In the case of aligning two different kinds of molecules, the algorithm is slightly modified while creating the bases. For each atom whose coordinate is P, select two atoms connected with the atom, assuming that the coordinates for these two atoms are Q1 and Q2

respectively. The rule for constructing basis is 1. Normalize PQJJJJK1

to eJK1

and is illustrated in Figure 1(b). The origin of the new coordinate frame is P. If an atom is connected with n atoms, there would be n× n( −1) coordinate frames made for this atom. In this way, the number of constructed coordinate frames is too large so that the execution is not efficient. In order to decrease the execution time, the criteria for selecting atoms to create bases is listed in Table 1. Then we calculate two bases for each residue, while we calculate four bases for each nucleotide. In proteins, the “ CA ” atom is on the backbone and attached with a side-chain, and the “ CB ” atom is the attached atom. In nucleic acids, the “ C4*” atom and the “ C3*” atom are both on the similar position as the “ CA ” atom in proteins. And

“ O4*” atom and “ C2*” are on the similar position as the “ CB ” atom in proteins. This is illustrated in Figure 2.

Table 1. The rule for selecting atoms to construct coordinate frames.

Type of the Figure 1. Calculation of a basis. (a) The protein structure. (b) The general molecule structure.

Figure 2. A sketch of molecules to explain the rule for coordinate frame construction. (a) Amino acid. (b)Nucleotide.

3.2. Fine Tuning Process: Step Two

Once the previous process is done by geometric hashing for global optimization with an output of approximate alignment, the following process is a fine tuning process based on local optimization of overlapped parts. This step is necessary, since the 3D structural data in PDB always involve sampling error in X-ray crystallography in determining atom positions.

Furthermore, geometric hashing just provides initial alignment. Therefore the alignment needs fine tuning, and so Iterative Closest Point (ICP) algorithm [23] [24]

is chosen. As illustrated in Figure 3, ICP algorithm is used in this process repeatedly, until the number of overlapped atoms within a given distance threshold can be increased no more.

The ICP algorithm proposes a solution to a key registration problem below: given two three-dimensional shapes, estimate the optimal translation and rotation that register the two shapes by minimizing the mean square distance between them. The algorithm guarantees that a local minimum of a mean square objective function is found [23]. In our implementation, we select 100 rigid transformations that lead to maximum numbers of overlapped pairs. The results show that ICP indeed increases the number of atoms matched.

Figure 3. The flow chart for fine tuning process.

相關文件