中華大學碩士論文

(1)

中華大學碩士論文

題目：以平行分支界定演算法及 MPI 進行化合物推論 Parallel Branch and Bound approach with MPI

Technology in Inferring Chemical Compounds with Path Frequency

系所別：資訊工程學系碩士班學號姓名：M09602046 王暉元指導教授：游坤明博士

中華民國九十九年二月

(2)

I

摘要

要設計新的藥物，對藥物的種類、特性做適當的分類是相當重要的。在藥物設計中，化合物的結構關係是重要的一環。在生化學科中有關化合物結構的方法中，

將物件用 kernel method 對應在特徵空間的特徵向量。Kernel method 是一相當有效用的演算法，並且廣泛的運用在 pattern recognition。在 kernel methods 中，一組物件或資料在目標問題被對應到某一空間內，這稱為一個特徵空間。當一個物件被視為某一類別，則會被定義成一個在特徵空間內的特徵向量。給定某一個已經對應在特徵空間中的一點，將其對應回一個物件。在本研究中，以路徑頻率最為特徵向量。在化合物的複雜程度越高時，從路徑頻率演化成圖形時所需要花費的時間將大量的提升。為了解決問題，依據之前的相關研究，提出一個平行分支定界演算法搭配 MPI 以減少運算的時間。利用了平行處理的方式將工作進行切割，並適當地分配給不同的核心去進行處理，藉此以降低整體的運算時間。

同時將完成的結果轉成 SMILES 格式，以過濾重複的結果。

關鍵字：化合物推論,平行分支定界演算法, SMILES, MPI

(3)

II

Abstract

Drug design is the approach of finding drugs by using computational tools. When designing a new drug, the structure of the drug molecule could be modeled by classification of potential chemical compounds. Kernel Methods have been successfully used in classifying potential chemical compounds. Frequency of labeled paths has been proposed to map compounds into feature in order to classify the characteristics of target compounds. In this study, we proposed an algorithm based on Kernel method via parallel computing technology to reduce computation time. This less constrain of timing allows us to aim at back tracking a full scheme of all of the possible pre-images, regardless of their difference in molecular structure, only if they shared with the same feature vector. Our method is modified on BB-CIPF and used MPI to reduce the computation time. We could filter some repetitious result by transferring SIMLES form. The experimental results show that our algorithms can reduce the computation time effectively for chemical compound inference problem.

Keyword: chemical compound inference, parallel branch-and-bound, SMILES, MPI

(4)

III

致謝

本論文能順利完成，感謝游坤明老師的指導與教誨，對於研究的方向、

觀念的啟迪、架構的匡正、城市的撰寫與求學的態度逐一斧正與細細關懷，於此獻上最深的敬意與謝意。論文口試期間，承蒙諸位口試委員，使得本論文更臻完備，在此謹深致謝忱。

實驗室裡共同的生活點滴，學術上的討論，感謝眾位學長姐、同學、

學弟妹。對於所有幫助過我、關懷過我的人，致上由衷感謝。

最後，謹以此文獻給我摯愛的雙親。

(5)

IV

Table of content

CHAPTER1. INTRODUCTION ... 1

CHAPTER2. RELATED WORK ... 5

2.1 The BB-CIPF algorithm ... 5

2.2 Message Passing Interface ... 8

2.3 SMILES ... 10

CHAPTER3. RELATED TOOLKIT ... 12

3.1 Parallel computing model ... 12

3.2 MPI ... 14

3.3 MPI process model ... 14

3.4 MPI functions... 15

CHAPTER4. PARALLEL-BB-CIPF (PB-CIPF) ... 16

4.1 Overview of PB-CIPF ... 16

4.2 BFS approach ... 19

4.3 DFS approach... 22

4.4 The procedure of PB-CIPF ... 27

4.5 Encoding SMILES form ... 29

4.6 Modified PB-CIPF ... 30

4.6.1 Isomorphic problem ... 30

4.6.2 Job scheduling ... 31

CHAPTER5. EXPERIMENT ... 33

5.1 Experimental environments ... 33

5.2 Experimental result ... 33

5.2.1 PB-CIPF ... 33

5.2.2 PB-CIPF-M ... 39

(6)

V

CHAPTER6. CONCLUSION ... 42 References ... 43

(7)

VI

List of Figures

Figure 2-1: Inferring a Chemical Structure from a Feature Vector ... 6

Figure 2-2: An illustration of a multitree G and its feature vector ... 7

Figure 2-3: an example of SMILES form :CC( C )C(=O)O ... 10

Figure 3-1: Basic concept of parallel computing ... 12

Figure 3-2: A basic concept of master and slaves approach ... 13

Figure 3-3: MPI programming model ... 14

Figure 4-1: Inferring all possible c’=φ (g) of a graph from a feature vector. ... 17

Figure 4-2 : Procedure diagram of PB-CIPF ... 18

Figure 4-3: flow chart of BFS ... 20

Figure 4-4: example of BFS processing... 21

Figure 4-5: An example of BFS stage of PB-CIPF ... 22

Figure 4-6: flow chart of DFS ... 24

Figure 4-7: inserting atoms ... 25

Figure 4-8: feature vector of result compound will be same as target ... 25

Figure 4-9: An example of DFS stage of PB-CIPF... 26

Figure 4-10: a sample compound ... 30

Figure 4-11: a sample of the approach ... 31

Figure 5-12: a sample of the job scheduling ... 32

Figure 5-1: Makespan of C11108 ... 35

Figure 5-5: Makespan of C15987 with K=1 ... 37

Figure 5-6: Speedup ratio of 2 nodes ... 38

Figure 5-7: Speedup ratio of 4 nodes ... 39

(8)

List of tables

Table 3-1: common functions of MPI ... 15 Table 5-1: size of compounds with and without H atoms ... 34

(9)

1

CHAPTER1. INTRODUCTION

Drug design is a very valuable issue in the chemogenomics[15] . To classify appropriately characteristic, classification of drug are important in designing new drug. In the arena of drug lead identification and optimization, for the intended purpose of better decisions with the application. The application will use computational methods to solve chemical problems with particular on the operation of chemical structure information. Following increases in computer power, particularly for desktop machines, have provided the resources to deal with the deluge. Many studies of drug discovery make use of techniques[18], from the designing new synthetic routes by searching databases of known reactions through the construction of computational models such as quantitative structure activity relationships which relate observed biological activity to chemical structure to the use of molecular docking programs to predict the three-dimensional structures in order to select a set of compounds for screening. Recent chemical developments for drug discovery are generating a lot of chemical data[18]. These developments are combinatorial chemistry and high-throughput computing. This has created a demand to effectively collect, organize, and apply the information technology.

Quantitative structure activity relationship was used to classify of compounds by many researchers. Support Vector Machines (SVMs) [11, 12] of kernel methods [7-9, 13, 16, 19] have been widely used in various classification problems of chemogenomics. Kernel methods have been applied to various classification problems.

Kernel method is usually required to develop a mapping from the set of objects in the target problem to a feature space and a kernel function is defined as an inner product between two feature vectors. In order to apply kernel methods, it is usually required to

(10)

2

develop a mapping from the set of objects in the target problem to a feature space. An object will be defined as a feature vector in the feature space by Kernel methods, and then SVMs can be employed to learn the classification rules. Feature vectors have been successfully used based on frequency of labeled paths [16, 19] or frequency of small fragments [9, 13].

In this thesis, we are interested in finding different compounds which have same characteristics. We will use feature vectors to represent characteristics of compounds.

Then we will try to find the same feature vector with different compounds. To find the results, we adopted a branch and bond algorithm to search results. Branch and bond search tree will grow proportions. Exploring of the search tree could be done more quickly by several processes. Faster exploring will allow for pruning more nodes or for bonding more branches of the tree during the search. Therefore, a branch and bond algorithm is suitable for parallel computing.

Thus we will use parallel computing to help us finding the result. Parallel computing uses multiple processing elements simultaneously to solve a problem. This is accomplished by breaking the problem into independent parts so that each processing element can execute its part of the algorithm simultaneously with the others.

A desired object is computed as a point in the feature space using a suitable function and then the point is then mapped back to the input space, where this mapped back object is called a pre-image.

Let φ be a function of mapping from an input space G to a feature space F. The pre-image problem[6] is, given a point y in F, to find x in G such that y = φ(x), through a proper function, if the feature vector can be mapped backward to an object from y such as y = φ(x), where such x is called a pre-image.

(11)

3

For example, if we want to infer a graph from numbers of occurrences of vertex-labeled paths [3, 4], in [3], a feature vector g is a multiple set of strings of labels with length at most K which represents the path frequency. Given a feature vector g, we considered the problem of finding a vertex-labeled graph G that attains a one-to-one correspondence between g and the set of sequences of labels along all paths of length at most K in G.

In this thesis, we took into compound structure of bigger size (the number of carbon atoms are less than 20). Bigger compound size require more time in inferring a pre-image from path frequency of g.

Parallel computing can make more computing resources than a single processor [22]. In science and engineering, some applications (the complex challenge problems) are computationally bounded. Also, in new application areas, where large amounts of computation can be put to profitability by using parallel computing, such as data mining (extracting consumer spending patterns from customer data) and optimization (just-in-time retail delivery).

Processing computations in a parallel way is natural and intuitive, because the real world is naturally parallel. Even if processor speed still improves continuously by Moore's Law, parallel computation is still likely to be more cost-effective. This is largely due to the design costs and makeup of each new generation of processors, with which it is still expensive to apply newer technologies, such as optical computation.

Parallel computing has allowed complex problems to be solved and high-performance applications to be implemented in science and engineering area, or in new application areas [22].

We want to reduce the computing time. We developed a parallel computing method by modifying the algorithm published by Akutsu and Fukagawa[5]. The main

(12)

4

concept is to assign independent tasks to different computing nodes, expectity for reducing computing time.

We used a message passing interface to implement parallel computing. MPI is a language-independent communications protocol used to program parallel computers.

Both point-to-point and collective communication are supported. MPI is a message-passing application programmer interface, together with protocol and semantic specifications for how its features must behave in any implementation.

After getting a result, we will transfer the result to SMILES form. SMILES (Simplified Molecular Input Line Entry System) is a chemical notation system based on principles of molecular graph theory. SMILES allows rigorous structure specification by use of a very small and natural grammar. The simplified molecular input line entry specification or SMILES is a specification for unambiguously describing the structure of chemical molecules using short ASCII strings. The SMILES is unique for each structure [25]. Thus we could filter some repetitious result by transferring the results into SIMLES form.

The rest of this thesis is organized as follows. Chapter two will introduce the background and related work. Chapter three will introduce the technology which we used in this thesis. We propose our parallel method and description in chapter four. In chapter five, we show the experiment result. Finally we conclude this paper in chapter six.

(13)

5

CHAPTER2. RELATED WORK

2.1 The BB-CIPF algorithm

Terms of compound characteristics such as frequency of labeled paths [16, 19] or frequency of small fragments [9, 13] are used by some researchers to classify compounds. Extending from inferring a tree from walks [20] or a graphic reconstruction problem [17], Kernel Principal Component Analysis and Regression [9]

and stochastic search algorithms [13] are used to find pre-images. However, in previous cases the obtained results and performance of these algorithms was not thoroughly verified against more complex compound cases.

Figure 2-1 shows that giving a target x using kernel method to find φ(x) to infer the compound. Inferring a chemical structure from a feature vector based on frequency of labeled-paths and small fragments, Branch-and-Bound Chemical compound Inference from Path Frequency (BB-CIPF) [5] is used to infer chemical compounds of tree-like structures. BB-CIPF algorithm extends from [3, 4]. BB-CIPF uses tree-like structures and infers chemical compound. Chemical compounds are assigned with a feature vector by the algorithm based on frequency of small fragments.

Moderate size (i.e., the number of carbon atoms are less than 20) chemical compounds are inferred.

(14)

6

Figure 2-1: Inferring a Chemical Structure from a Feature Vector

The pre-image problem is defined as follows. Let Σ be an alphabet and ΣK be the set of strings with length K over Σ. For a string t and a graph G, occ(t, G) denotes the number of occurrences of substring t in G. Then, the feature vector fK (G) of level K for G is a |ΣK|-dimensional integer vector such that the coordinate indexed by t ∈ ΣK

is occ(t, G). That is, fK (G) is defined by fK (G) = (occ(t, G)). For example, consider a compound C₂H₄O₂ (see figure 2-2) over Σ = {C, O, H} and the K value is 1. Then, fK (G) = (2,2,4,2,2,3,2,0,1,3,1,0) because occ(C,G) = 2, occ(O,G) = 2, occ(H,G) = 4, occ(CC,G) = 2, occ(CO,G) = 2, and so on. If K is large, the number of dimensions of a feature vector will be large (exponential of K).

(15)

7

C O H CO CH OC CC OH HC HO HH

2 2 4 2 3 2 2 1 3 1 0

Figure 2-2: An illustration of a multitree G and its feature vector

Give a target compound T_target. Expect T_cur will be inferred to T_target. T_cur insert a node n becoming Tnext. Let ftarget be the given feature vector for which a pre-image should be computed. Since information on paths of length 0 is included in ftarget, the size n of the target chemical compound is known. Moreover, the number of occurrences of atom types in the pre-image of ftarget is known. ftarget is the feature vector of T_target and f_next is the feature vector of T_next. If the feature vector f_next of T_next does not match with the feature vector ftarget of Ttarget, the Tnext will be discarded and Tnext will not continue to carry out the evolution of Tnext. Tcur may be re-inserted into another node for comparison with T_target.

The concept of branch-and-bound chemical compound inference from path frequency algorithm is inferring tree-like structures of chemical compounds. Back tracking a full scheme of all of the possible pre-images, regardless of their difference in molecular structure, if compounds shared with the same feature vector to come close to the reality of the development of algorithms for drug design.

The BB-CIPF algorithm will track back pre-images as the solution of partial

(16)

8

results. For example, given a target compound, if three objects have the same feature vector, then those are the partial results. The BB-CIPF algorithm can be modified for inferring more general classes of chemical compounds or for feature vectors based on frequency of small fragments. Several definitions of BB-CIPF will be shown. Let atomset(f) be the multi-set of atom types in the pre-image of a feature vector f. Let ATOMBONDPAIRS be a set of possible atom-bond pairs. For example, if C,N,O,H are only considered and aromatic bonds are not considered, it is defined as

ATOMBONDPAIRS = {(C, 1), (C, 2), (C, 3), (N, 1), (N, 2), (N, 3), (O, 1), (O, 2), (H, 1)}.

It should be noted that (C, 4) is not included since it is not necessary to consider a compound consisting of only two carbon atoms.

The basic idea of BB-CIPF: beginning from a small tree, a leaf is added one by one. BB-CIPF algorithm will maintain trees. Moreover, BB-CIPF algorithm does not employ dynamic programming, instead, it employs a depth-first search procedure.

When adding a leaf u, BB-CIPF algorithm will examine basically all combinations of an atom-bond pair (A, B) from ATOMBONDPAIRS and a vertex w in the current tree. However, there is no need to examine the following cases, where T_cur will be the current tree and Tnext will be the next candidate tree obtained by adding a leaf to Tcur:

(i) Addition of a leaf with atom label A violates the condition on the numbers of atom occurrences.

(ii) Connection of a leaf to w ∈ Tcur by bond type B violates the condition on the valence of w.

(iii) Connection of a leaf to w ∈ T_cur violates the condition on feature vectors (i.e., f(Tnext) ≤ f(Ttarget) must hold since Tnext must be a subgraph of Ttarget).

2.2 Message Passing Interface

(17)

9

Message Passing Interface has already been used in solving chemistry problem [24]. MPI (Message Passing Interface) is a specification for a standard message passing library that was defined by the MPI Forum, a broadly based group of parallel computer vendors, library writers, and applications specialists. Multiple implementations of MPI have been developed[14].

The message-passing model of parallel computation has emerged as an expressive, efficient, and well-understood paradigm for parallel programming. Until recently, the syntax and precise semantics of each message-passing library implementation were different from the others, although many of the general semantics were similar. The proliferation of message-passing library designs from both vendors and users was appropriate for a while, but eventually it was seen that enough consensus on requirements and general semantics for message-passing had been reached that an attempt at standardization might usefully be undertaken.

MPI is a message-passing application programmer interface, which includes protocol and semantic specifications for behaving implementation in the feature (such as a message buffering and message delivery progress requirement) [14]. MPI includes point-to-point message passing and collective or global operations. Moreover, MPI provides two levels for processes: first, processes will be named according to the rank of the group in performed communication. Second, virtual topologies will allow for graph or Cartesian naming of processes that help relate the application semantics to the message passing semantics in a convenient and efficient way. Communicators provide an important measure of safety that is necessary and useful for building up library-oriented parallel code and gather groups and communication context information.

(18)

10

2.3 SMILES

SMILES (Simplified Molecular Input Line Entry System) is a chemical notation system designed for modern chemical information processing [25]. Figure 2-3 is an example of SMILES form. SMILES is based on principles of molecular graph theory, allowing rigorous structure specification by using a very small and intuitional grammar. The SMILES notation system will be also suited well for high-speed machine processing. The resulting ease of usage by the chemist and machine allow many highly efficient chemical computer applications to be designed including generation of a unique notation, constant-speed database retrieval, flexible substructure searching, and property prediction models. It is the result of achieving the following original objectives [25]:

(1) The graph of a chemical structure is to be uniquely described, including but not limited it to the molecular graph comprising nodes (atoms) and edges (bonds).

(2) A user-friendly structure specification is to be provided, so that all input rules can be learned quickly and naturally.

(3) A machine-friendly and machine-independent system is to be designed for interpretation and generation of a unique notation.

Figure 2-3: an example of SMILES form :CC( C )C(=O)O

(19)

11

Rules for generating SMILES for virtually any chemical structure are shown in following figure 2-3. The structure was translated to SMILES form CC( C )C(=O)O by following rules:

 Atom: Atoms are represented by their atomic symbols. This is the only required use of letters in SMILES.

 Bonds: Single, double, triple, and aromatic bonds are represented by the symbols -, =, #, and respectively. Single and aromatic bonds usually are omitted.

 Branches: Branches are specified by enclosures in parentheses.

(20)

12

CHAPTER3. RELATED TOOLKIT

3.1 Parallel computing model

In the simplest sense, parallel computing is using multiple computing resources at the same time to solve a computational problem. First, the problem will be broken into discrete parts that can be solved concurrently. Each part is further broken down to a series of tasks. Tasks from each part are executed on different CPUs at the same time. Figure 3-1 shows a basic concept of parallel computing.

Figure 3-1: Basic concept of parallel computing

The figure 3-2 consists of two entities: the master and multiple slaves. The master is responsible for decomposing the problem into small tasks and assigns these tasks to slave processes. Gathering the computing results form slaves processes.

The master and slaves may use static load-balancing or dynamic load-balancing [21]. In the first case, the assigning of tasks is all performed at the beginning of the

(21)

13

computation. The static load-balancing approach allows the master to participate in the computation after each slave has been allocated a fragment of the work. The allocation of tasks can be done once or in a cyclical way. An alternative way is using a dynamically load-balanced approach, which is more suitable when the number of tasks exceeds the number of available processors, when the execution times can not be predicted or when dealing with unbalanced problems. An important feature of dynamic load-balancing is the ability of the application to adapt itself to changing conditions of the system failure of some processors, which simplifies the creation of applications that are capable of surviving the loss of slaves or even the master [21].

Figure 3-2: A basic concept of master and slaves approach

(22)

14

The master and slaves approach can achieve high computational speedups and an interesting degree of scalability [21]. However, with a large number of processors the centralized control of the master process may become a bottleneck to the applications.

It is possible to enhance the scalability of the paradigm by extending the single master to a set of masters, each of them controlling a group of process slaves.

3.2 MPI

3.3 MPI process model

Currently, there are several implementations of MPI, including versions for networks of workstations, clusters of personal computers, distributed-memory multi-processors, and shared-memory machines [23]. Almost every hardware vender is supporting MPI. This gives the user an MPI program that can be executed on almost all of the existing computing platforms without the need to rewrite the program.

In the MPI programming model (figure 3-3), a computation comprises one or more processes that communicate by calling library routines to send and receive messages to other processes. In most MPI implementations, a fixed set of processes is created at program initialization, and one process is created per processor.

Figure 3-3: MPI programming model

In MPI, a parallel application consists of a number of processors that runs

(23)

15

concurrently. Each processor has its own local memory and communicates with other processors by sending and receiving messages. When data is passed in a message, both processors must work to transfer the data from the local memory of one to the local memory of the other.

3.4 MPI functions

Table 3-1 will show some commonly used functions in MPI. A MPI program will be initialized with a call to the MPI_INIT. This must be the first MPI call. The default communicator, MPI_COMM_WORLD will define the communication context and set of all processes used. MPI_COMM_WORLD must be used in every program and subprogram that makes MPI calls. MPI_COMM_SIZE will query the number of processors being used. MPI_COMM_RANK will identify the processor number on which execution is taking place. Finally, the MPI environment is terminated by a call to MPI_FINALIZE, and no MPI calls may be made after execution of this statement.

Every process must make this call.

Table 3-1: common functions of MPI

MPI_INIT MPI is initialized with a call to the

subroutine

MPI_COMM_WORLD defines the communication context and

set of all processes used

MPI_COMM_SIZE query the number of processors being

used

MPI_COMM_RANK identify the processor number on which

execution is taking place

MPI_FINALIZE no MPI calls may be made after

execution of this statement

(24)

16

CHAPTER4. PARALLEL-BB-CIPF (PB-CIPF)

In order to reduce computing time in inferring chemical compounds, we adopted the concept of parallel computing. With parallel computing, we use multiple processors to separate each task computing on different processors.

Solving NP-hard discrete optimization problems to optimality is often an immense job requiring very efficient algorithms, and the Brach and bound paradigm is one of the main tools in construction of these [10]. Our method is based on BB-CIPF. We will find all possible compounds with the same feature vector. To reduce the computing time, the method we proposed used MPI to solve the chemical compound inference problem.

A branch and bound algorithm will search the complete space of solutions for a given problem for the best one. However, correct to list in detail is normally impossible due to the exponentially increasing number of potential solutions. The use of bounds for the function to be optimized combined with the value of the current best solution enables the algorithm to only implicitly search parts of the solution space [10].

4.1 Overview of PB-CIPF

In the previous section, we have described that the feature vector g in feature space has been mapped from a compound c thought a function φ, and we want to find all possible c’=φ (g). Figure 4-1 shows that we are interested in inferring compounds which have the same feature vector of the target compound. If a compound structure is bigger, the solution space will be bigger. It will lead to larger computation time to find the answer. In other words, consistently with the feature vector g will decrease

(25)

17

more time will be needed to map back to c’ from g, this will incur a substantial increase in the amount of time spent.

Figure 4-1: Inferring all possible c’=φ (g) of a graph from a feature vector.

In this study, we developed a parallel algorithm to decrease the computing time and to find all possible compound structure having the same feature vector with the target compound. The task will be separated into several parts and appropriate to distribute for each computing node.

Figure 4-2 shows the idea of our approach. The proposed method has two stages.

First, a master node builds several candidate compounds using BFS. Subsequently candidate compounds are dispatched to each participating computing nodes according to block distribution. Then each computing node will infer the c’=φ (g) using a DFS approach. The BFS and DFS adopted branch and bound approach.

At the first stage BFS stage, the first step, the master node loads the target compound and computes the feature vector. The master computing node employs the Breadth-First-Search (BFS) approach to obtain the candidate compounds. A master

(26)

18

node will build several candidate compounds using BFS approach and obtain its path frequency for distributing jobs. Tasks will be putted into a block with appropriate amount then each block will be assigned one by one to computing nodes.

The second stage is the DFS stage, after the master computing node assigns tasks, each computing node uses Depth-First-Search (DFS) approach to insert an atom into a candidate compound. After inserting an atom, the candidate compound’s feature vector will be compared with the feature vector of the target compound. If the feature vectors of the candidate compound and target compound have the same feature vector with parts of target compound structure, then the atom will be retained. If the candidate feature vector is different from target compound structure, then the atom will be dropped and the DFS approach will be applied to insert another atom into the candidate compound. If the candidate compounds have the same feature vector with the target compound, then there is a solution. Do the same thing until all nodes have completed their candidate compounds.

Figure 4-2 : Procedure diagram of PB-CIPF

(27)

19

The Branch and bound approach consists of three:

(1) Branching Stage: Insert a new node to the selected candidate compound.

(2) Bounding Stage: If addition of a leaf with atom label violates the condition on the numbers of atom occurrences.

(3) Terminating Stage: If all candidate compounds have finished computing.

Each task will spend different computing time. A bigger structure compound means having a larger solution space and thus requires more computing time. For example, if we have four computing node C1, C2, C3 and C4. Master computing node C1 analyzes the target compound calculating the feature vector and establishing four tasks called T1, T2, T3, T4, assigning these tasks to four computing nodes for execution. It will stop counting after the last task is finished.

4.2 BFS approach

BFS is an uninformed search method that aims to expand and examine all nodes of a graph or combination of sequences by systematically searching through every solution [1]. In other words, it can search exhaustively the entire graph or sequence without considering the goal until it finds it. It is not use a heuristic algorithm. In graph theory, breadth-first search (BFS) is a graph search algorithm that begins at the root node and explores all the neighboring nodes [1]. Then for each of those nearest nodes, it will explore their unexplored neighbor nodes, and so on, until it finds the goal.

At BFS stage, the master node will load the target compound and compute the feature vector. The master computing node will employ the Breadth-First-Search approach to obtain the candidate compounds. Figure 4-3 will show the flow chart of

(28)

20

BFS. Each job in BFS will start with finding an initial compound, after we will try to insert a new atom to the initial compound. To check the previous step produced compound can be candidate or not. We will compute the feature vector of produced compound. If the feature vector of the produced compound is match part of the target feature vector then it is possible that produced compound can be one of the solutions that have the same feature vector with the target feature vector.

Figure 4-3: flow chart of BFS

Now we show more detail of BFS execution in figure 4-4. First, we will read the target compound and compute the feature vector, getting the atom’s set. For example, if the target compound is C2H4O2, then the atom set will be [C, O, H]. We will choose a starting atom to grow. However, H atoms will not be chosen. Insert a new atom from the atom set to the starting atom (e.g. we have a C atom then we insert another C

(29)

21

atom, so we have a compound C-C). After inserting a new atom, we compare the feature vector to the target feature vector. If the feature vector is part of target feature vector(C-C is a part of target feature vector), than it might be a solution.

Figure 4-4: example of BFS processing

The BFS stage’s objective is to produce enough candidate compounds to run the DFS inferring algorithm. Using BFS to create a broader solution base, next stage can be easily parallel computed. A master node will build several candidate compounds using the BFS approach and with obtain its path frequency for distributing jobs. The produced candidate compounds are gathered as tasks. Tasks will be put into a block with appropriate amount then each block will be assigned one by one to computing nodes. Figure 4-5 shows first that we will produce many candidate compounds and gather them as many blocks. We will then separate each block to the computing node to run the next stage.

(30)

22

Figure 4-5: An example of BFS stage of PB-CIPF

4.3 DFS approach

Depth-first search (DFS) is an algorithm for traversing or searching a tree, tree structure, or graph. It would start at the root (selecting some node as the root in the graph case) and explore as far as possible along each branch before backtracking [2].

Formally, DFS is an uninformed search that progresses by expanding the first child node of the search tree that appears and thus goes deeper and deeper until a goal node is found, or until it hits a node that has no children [2]. Then the search backtracks, returning to the most recent node that it has not finished exploring. In a non-recursive implementation, all freshly expanded nodes are added to a stack for exploration.

(31)

23

When searching large graphs that cannot be fully contained in memory, DFS would suffer from non-termination when the length of a path in the search tree is infinite.

Depth-First-Search (DFS) approach will insert an atom into a candidate compound. After inserting an atom, the candidate compound will be compared feature vector with target compound. From figure 4-6, we can see that the candidate compound will be inserted a new atom, then check the feature vector of the new compound will match the part of the target feature or not. If the feature vector does not match then stop insert atoms and drop this task. If feature vector of new compound matches, then insert another new atom. The candidate compound that feature vector matches part of target feature vector will be the new candidate compound and continue inserting approach. If the feature vector of candidate compound and target compound has the same feature vector with parts of target compound structure, then the atom will be kept. When the feature vector of candidate compound is same as the target feature vector, then the candidate compound is one of the solutions.

(32)

24

Figure 4-6: flow chart of DFS

Now we will show more detail of the way running DFS in figure 4-7. Previous stage BFS will produce many candidate compounds. The DFS approach is to search until a solution is found, or has no solution. For example, after get the candidate compound (such as C-C), choose an atom in the atom set (such as [C,O,H]). Insert atom to the candidate compound (figure 4-7(a)). If the candidate feature vector is different from target compound structure, then the atom will be dropped and continue to apply DFS approach to insert another atom into the candidate compound. When candidate compound in addition of a leaf with atom willviolates the condition on the

(33)

25

numbers of occurrences of atoms then drop this candidate compound. After running pervious step, if finding a compound with matching the target will get a result (see figure 4-7(c)).

Figure 4-7: inserting atoms

We can see that from figure 4-8, if the candidate compounds have the same feature vector with the target structure, and then it will be a solution. Do the same thing until all nodes completed its candidate compounds.

Figure 4-8: feature vector of result compound will be same as target

After getting the candidate compound from BFS approach, choose an atom in the

(34)

26

atom set. Insert atom into the candidate compound. After inserting a new atom, check the feature vector matches the target feature vector or not. If inserting the node makes the compound no matching, drop the atom and insert another atom. The inserting approach is come out through tried and error. From figure 4-9 we can see that each computing node will run the DFS approach. Continue inserting atoms until the match the condition which is that has the same feature vector as the target’s.

Figure 4-9: An example of DFS stage of PB-CIPF

(35)

27

4.4 The procedure of PB-CIPF

Procedure PB-CIPF(T^temp_i, f^temp_i, T^target, f^target) GET processors numbers: n

Let first computing node = master computing node Master computing node:

step 1: Produce candidate compounds. Run BFS (T^target). Store all candidate compounds computed in T^queue.

the procedure of BFS is shown below:

Procedure BFS (T^target):

for (a = all atoms exist in T^target ) do

Let T^temp be a temporary candidate compound Initial T^temp

Insert a as a new node into T^temp Add T^temp to T^queue

Process the next atom in T^target end

step 2: Gather candidate compounds as tasks. Block tasks by computing nodes number

step 3: For each block of tasks assigned to computing node.

Slave computing node:

step 1: Receive tasks.

(36)

28

step 2: Run DFS approach.

The procedure of DFS is shown below:

Procedure DFS(T^tempi, f^tempi, T^target, f^target) if f^tempi = f^target

then output a solution T^tempi; return true;

else return false;

for (a = all atoms exist in T^target )do if atom = H continue;

(Hydrogen atoms will be added at the last stage) if {l(u)|u V(T^temp) ∪ {a} atomset(f^terget)

(set means multiset here) then continue;

for all w ∈ V(T^temp) do

Let T^next be a tree formed by connecting new leaf node u if w does not satisfy the valence constraint

then continue;

Compute f^next from T^next and f^temp; if DFS(T^next, f^next, T^target, f^target)=true

then return true;

end end

return false;

step 3: Send result to a queue stored all matched result.

(37)

29

step 4: If process still needed to step 2. Else end.

The pseudo code of PB-CIPF is shown above; the implemented code has more details. The followings are the detailed parts:

(i) Hydrogen atoms will be added at the last stage of the inferring procedure.

Hydrogen atoms will be added only if the frequencies of the other atoms are the same as those in the target feature vector.

(ii) When calculating fnext from Tnext and fcur, only paths beginning from and ending at a new node are computed.

(iii) Benzene rings can be added as if these were leaves, where structural information on benzene is utilized for calculating feature vectors.

(iv) A benzene ring will be given an initial structure when a compound is small and contains a benzene ring.

4.5 Encoding SMILES form

After producing a new result, we will transfer the data structure to reduce the same result compound. Therefore we will use SMILES form. Read each atom position from the compound. From figure 4-10 we can see that the position of atoms will be 1:C, 2:C, 3:O, 4:O. In SMILES the H atom will be ignored, so the H atoms position will be skipped. We will start with the atom that has the least connected neighbor.

From figure 4-10 the least connected neighbor atoms will be 1: C, 3: O and 4: O. To be ease of computation, we choose 1: C being the first atom. Then, we find the neighbor atom of 1: C. If a branch is met we will use “()” to represent it, and we will swap the branch with the front atom. The next atom will be 2: C. Now the SMILES form is C-C. We continue finding the neighbors of 2: C. They are 4: O and 3: O, 3: O will be first inserted into the array, and subsequently is 4: O. Now the array will be

(38)

30

C-C-O(=O). To follow the SMILES rules, the branch is at the front position, so we swap 3: O and 4: O. After transferring, the final SMILES form will be C-C(=O)-O.

Figure 4-10: a sample compound

The procedure of encoding SMILES:

Step1: Get the position of atoms excepted H atoms.

Step2: Choose the atom without smallest neighbor number as the first atom.

Step3: Connect the neighbor of atom.

The bond: Single, double, triple, are represented by the symbols -, =, #.

If the atom is at a branch, using “()” to represent it and swap with the front atom.

Step4: After Step3, find the next atom.

Step5: Run Step4. Until all atoms are connected.

4.6 Modified PB-CIPF

4.6.1 Isomorphic problem

(39)

31

When running our inferring algorithm, there may compute some as isomorphic compounds. To avoid isomorphic problem, we modified our tree search approach.

When we receive a new candidate compound, we transfer the compound notation to SMILES. If the notation has appeared previously, we drop this candidate. If the candidate is a new notation, we keep it and continue our search approach, saving the notation. Figure 4-11 shows the process.

Figure 4-11: a sample of the approach

4.6.2 Job scheduling

At the master node, several tasks are computed. Tasks are stored in a global job queue. Each slave node has a local job queue. When the amount of jobs in the local jobs queue is less than 15, the slave node receives tasks sending from the master node.

If the amount of jobs in local jobs is equal to or bigger than 15, the slave node would not receive any tasks until the amount of jobs is less than 15. We keep each

(40)

32

computing only executing 15 tasks at once, in order keep loading balance.

Figure 5-12: a sample of the job scheduling

(41)

33

CHAPTER5. EXPERIMENT

Experiments were done to verify our algorithm can is efficient and reduce the completion time. The experiments were implemented for different number of computing nodes

5.1 Experimental environments

We used a PC cluster with AMD Athlon(TM) XP 2000+ CPU and 1 GB RAM which worked on the Linux to verify the performance of our algorithm. PB-CIPF was implemented using C language and the MPI version with MPICH2. The test data were obtained from KEGG LIGAND Compound Database.

5.2 Experimental result

5.2.1 PB-CIPF

In this section, experimental results of our algorithm were shown in terms of the comparisons of a single node, with two nodes, and four nodes. We increased the computing node to verify that increasing computing nodes can reduce the computing time.

The makespan is defined as following:

If in our experiment, there exist four computing nodes: C0, C1, C2 and C3, then the finish times of the four nodes is: t0, t1, t2, t3. If t2 is the longest finish time, then the makespan will be t2. Therefore the makespan is the longest finish tome among the four nodes.

We choose several compounds from KEGG LIGAND Database. We tested and verified our algorithm with K = 1, 2, 3, 4, where K is the length of sequence labels of feature vectors. A bigger value of K means more constraints for target compound.

(42)

34

Figures 5-1 to 5-5 show the makespan from different target compound with different size. Table will show different compound size with and with out H atoms.

C11108 with H atoms is 13 and without H atoms is 9. Size of C11109 with H atom is 14 and 7 without H atoms. Size of C00097 with H atoms is 15 and 9 without H atom.

Size of C15987 with H atom is 19 and 8 without H atoms.

Table 5-1: size of compounds with and without H atoms

with H atom without H atom

C11108 13 9

C11109 14 7

C15987 19 8

C00097 15 9

The smaller value of K creates more permutation with smaller path frequency lengths. The figher amount of permutation with path frequency will make more results that match the path frequency than when the amount of permutation is too big it will cost more computing time. Therefore, figure 5-5 shows that when computing C15987 with K=1, the computing time is much bigger than other value of K, so makespan of C15987 in figure 5-4 exclude the result of K =1.We can find that the makespan was reduced when the number of nodes increased. For example, in Figure 5-1, when the value of K is 4, the makespan is reduced from 399.546 seconds to 279.82 second by four nodes.

(43)

35

Figure 5-1: Makespan of C11108

0 50 100 150 200 250 300 350 400 450

1 node 2 nodes 4 nodes

computing time(sec.)

computing nodes

Makespan of C11108

K=1 K=2 K=3 K=4

0 2 4 6 8 10 12 14 16 18

computing nodes

Makespan of C00097

K=1 K=2 K=3 K=4

(44)

36

0 10 20 30 40 50 60

1node 2nodes 4nodes

computing nodes

Makespan of C11109

K=1 K=2 K=3 K=4

0 10 20 30 40 50 60

computing nodes

Makespan of C15987

K=2 K=3 K=4

(45)

37

Figure 5-5: Makespan of C15987 with K=1

0 50 100 150 200 250 300 350

computing nodes

Makespan of C15987

K=2 K=3 K=4 K=1

(46)

38

When K = 1 and the size of atoms is bigger than 19, the solution space will so big, it needs more computing time to finish. We use several compounds with different atom sizes to verify the performance of our algorithm. To look over performance of our algorithm, we compute the speedup ratio of each testing case. The speedup ratio is defined as follows:

If the computing time of single node is t0 and the computing time of two nodes is t1. Then the speedup ratio of 2 computing nodes will be t0 / t1.

Figure 5-6 and 5-7 show the speedup of our algorithm. Increasing the computing nodes to 4 nodes, we find that the average speedup ratio is about 1.9. According to our experiment, our algorithm can reduce the computing time.

Figure 5-6: Speedup ratio of 2 nodes 0

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

12 13 14 15 17 19

speedup ratio

atom sizes

Speedup of 2 nodes

K=1 K=2 K=3 K=4

(47)

39

Figure 5-7: Speedup ratio of 4 nodes

5.2.2 PB-CIPF-M

In this section we show the experimental result of Modified PB-CIPF. Table 5-2 shows the computing time of C00097. We can see that the difference between each computing node is small.

Figure 5-8 to 5-10 shows the makespan of C11108, C11109 and C15987 with modified PB-CIPF.

0 0.5 1 1.5 2 2.5 3 3.5 4

12 13 14 15 17 19

speedup ratio

atom sizes

Speedup of 4 nodes

K=1 K=2 K=3 K=4

(48)

40

0 50 100 150 200 250 300 350 400

1 node 2 nodes 4 nodes 8 nodes

computing nodes

Makespan of C11108

K = 1 K = 2 K = 3 K = 4

0 5 10 15 20 25 30 35 40

computing nodes

Makespan of C11109

K = 1 K = 2 K = 3 K = 4

(49)

41

0 2 4 6 8 10 12 14 16

computing nodes

Makespan of C15987

K = 1 K = 2 K = 3 K = 4

0 2 4 6 8 10 12 14 16 18 20

computing nodes

Makespan of C00097

K = 1 K = 2 K = 3 K = 4

(50)

42

CHAPTER6. CONCLUSION

Solving computational problems, parallel computing may be an effective method .Parallel computing is using multiple computing resources at the same time to reduce computing time. In this research, we proposed a parallel algorithm PB-CIPF for the problem of chemical compound inference from path frequency. We referred BB-CIPF algorithm for compound inference. BB-CIPF uses tree-like structures in infering chemical compound. Chemical compounds are assigned with a feature vector by the algorithm based on frequency of small fragments. Our approach has two stages.

First, a master node will build several candidate compounds using BFS approach.

Then we distribute the candidate compounds to participating computing nodes according to block distribution. Then each computing node will infer the c’=φ (g) using a DFS approach. The BFS and DFS adopted branch and bound approach. The experimental results show that our algorithm can reduce the computing time. When using 4 nodes to compute, the average speedup is 1.820136 when K=1, 1.891199 when K=2, 1.954588 when K=3 and 1.995495 when K=4. According to our experimental results, our parallel algorithm can reduce the computing time. We also try to improve our algorithm in task scheduling and filtering duplicate results.

Although parallel computing has respectable efficiency in solving computational problems, a better scheduling method could reduce more time. In the future, we will study how to make the job scheduling more balance and effective or decrease the communication in checking isomorphic compounds to shorten the time in evolution.

If we could do that, the computing time can be reduced greatly and have a better performance in speedup ratio.

(51)

43

References

[1] Breath-first search:http://en.wikipedia.org/wiki/Breath-first_search [2] Depth-first search:http://en.wikipedia.org/wiki/Depth-first_search [3] T. Akutsu and D. Fukagawa, "Inferring a graph from path frequency,"

Combinatorial Pattern Matching, vol. 16, 2005,371-392.

[4] T. Akutsu and D. Fukagawa, "On inference of a chemical structure from path frequency.," BIOINFO, 2005,96-100.

[5] T. Akutsu and D. Fukagawa, "Inferring a Chemical Structure from a Feature Vector Based on Frequency of Labeled Paths and Small Fragments.,"

Asia-Pacific bioinformatics conference 5th, 2007,165-174.

[6] G. H. Bakimathr, A. Zien, and K. Tsuda, "Learning to find graph pre-images. ,"

The 26th DAGM Symposium., 2004,253-261.

[7] A. Ben-Hur and W. Noble, "Kernel methods for predicting protein-protein interactions," Bioinformatics, vol. 21, 2005,38-46.

[8] C. J. C. Burges, "A Tutorial on Support Vector Machines for Pattern Recognition

" Data Mining and Knowledge Discovery, vol. 2, 1998,121-167

[9] E. Byvatov, U. Fechner, J. Sadowski, and G. Schneider, "Comparison of support vector machine and artiﬁcial neural network systems for drug/nondrug

classiﬁcation.," Journal of Chemical Information and Computer Sciences, vol.

(52)

44

43, 2003,1882-1889.

[10] J. Clausen, "Branch and Bound Algorithms Principles and Examples," Parallel Computing in Optimization, 1997,337-360.

[11] C. Cortes and V. Vapnik, "Support vector networks.," Machine Learning vol. 20, 1995,273-297.

[12] N. Cristianini and J. Shawe-Taylor, "An Introduction to Support Vector

Machines and Other Kernel-based Learning Methods.," Cambridge Univ. Press, 2000

[13] M. Deshpande, M. Kuramochi, N. Wale, and G. Karypis, "Frequent

substructure-based approaches for classifying chemical compounds. ," IEEE Trans. Knowledge and Data Engineering, vol. 17, 2005,1036-1050.

[14] W. Gropp, E. Lusk, N. Doss, and A. Skjellum., "A high-performance, portable implementation of the MPI message passing interface standard.," Parallel Computing, vol. 22, 1996,789-828.

[15] C. J. Harris and A. P. Stevens, "Chemogenomics: structuring the drug discovery process to gene families," Drug Discov Today, vol. 11, 2006,880-888.

[16] H. Kashima, K. Tsuda, and A. Inokuchi, "Marginalized kernels between labeled graphs," Machine Learning, vol. 20, 2003,321-328.

[17] J. Lauri and R. Scapellato, Topics in Graph Automorphisms and Reconstruction.:

(53)

45

Cambridge Univ. Press, 2003.

[18] A. R. Leach and V. J. Gillet, An Introduction to Chemoinformatics: Springer, 2003.

[19] P. Mahé, N. Ueda, T. Akutsu, J.-L. Perret, and J.-P. Vert, "Graph kernels for molecular structure-activity relationship analysis with support vector machines,"

Journal of Chemical Information and Modeling, vol. 45 2005,939-951.

[20] O. Maruyama and S. Miyano, "Inferring a tree from walks.," Theoretical Computer Science vol. 161, 1996

[21] L. M. e. Silva and R. Buyya, "Parallel Programming Paradigms," High

Performance Cluster Computing: The Programming and Application Issue, vol.

2, 1999

[22] D. B. Skillicorn and D. Talia, "Models and Languages for Parallel Computation.," ACM Computing Surveys, Vol. 30, No. 2, 1998,123-169.

[23] T. Tezduyar and Y. Osawa, "Methods for parallel computation of complex flow problems," Parallel Computing, vol. 25, 1999,2039-2066.

[24] J. J. Vincent and K. M. M. Jr, "A highly portable parallel implementation of AMBER4 using the message passing interface standard," Journal of

Computational Chemistry vol. 16, 1995,1420-1427.

[25] D. Weininger, "SMILES, a chemical language and information system. 1.

(54)

46

Introduction to methodology and encoding rules," Journal of chemical information and modeling, 1987,863-870.

中 華 大 學 碩 士 論 文