• 沒有找到結果。

Toward the Design of a Bioinformatic Machine V]p@Tqt h j

N/A
N/A
Protected

Academic year: 2022

Share "Toward the Design of a Bioinformatic Machine V]p@Tqt h j"

Copied!
64
0
0

加載中.... (立即查看全文)

全文

(1)

中 華 大 學 碩 士 論 文

邁向設計一個生物資訊電腦系統

Toward the Design of a Bioinformatic Machine

系 所 別:資訊工程學系碩士班 學號姓名:M09102048 賴韋丞 指導教授:許 文 龍 博士

中華民國 九十三 年 六 月 一 日

(2)
(3)
(4)
(5)

中文摘要

生物資訊的問題在於演算法所需龐大的時間複雜度和龐大的生物資訊資料 庫,這是目前對電腦科技的一大挑戰,在本論文中,生物資訊電腦系統的目的是 在挑戰這些問題。

這個生物資訊電腦系統是叢集電腦架構,用特殊硬體來加速實現動態規化、

基因演算法和資料探勘演算法,在這個系統中,我們設計物件關聯式資料庫機器 (ORDBM)有兩個硬體電路模組:關聯式運算處理單元(ROPU)和叢集搜尋演算法 (CPSS),序列比對的硬體電路模組包括動態規化序列比對的平行化硬體電路 (PHDPA)和基因演算法的硬體電路(HGA),在本論文中我們著重在關聯式運算處 理單元(ROPU)和動態規化序列比對的平行化硬體電路(PHDPA)的設計。

關聯式運算處理單元(ROPU)是大量的處理器陣列,可以執行 Ugsel、Mgsel、

Gadd 和 Match 四種演算法做為資料探勘之用,在執行完關聯式運算處理單元 (ROPU)之後,叢集電腦的每一個結點的物件資料庫的物件識別值被排序好而且 資料庫的資料被分散的很完美,關聯式運算處理單元(ROPU)加上叢集電腦處理 器(MPU)可以執行所有關聯式資料庫的關聯式代數與計算函數的運算。

動態規化序列比對的平行化硬體電路(PHDPA)是用硬體來加速動態規化序 列比對,用來解決龐大的時間複雜度,我們設計專門的處理器用來計算動態規化 序列比對的分數,根據動態規化的表格計算,我們為此一般使用來發展出一套控 制策略,之後,將動態規化序列比對的平行化硬體電路(PHDPA)模組應用在叢集 電腦上。

(6)

Abstract

The bioinformatic problems usually involve huge time complexity and very large databases. It is a great challenge to current computer technology. The concept of a bioinformatic machine is proposed to meet such challenge.

This bioinformatic machine is a PC cluster structure using special hardware to accelerate dynamic programming, genetic algorithm and data mining algorithm. In this machine, Object Relational DataBase Machine is designed with two hardware modules: Relational Operation Processing Unit (ROPU) and Cluster-based Parallel Search System (CPSS). Sequence alignment hardware modules includes Parallel Hardware of Dynamic Programming Alignment (PHDPA) and Hardware of Genetic Algorithm (HGA). We focus on designing ROPU and PHDPA in the dissertation.

ROPU is a massively processor array and can process Ugsel, Mgsel, Gadd, and Match four algorithms for data mining. The object database is distributed to every cluster node. ROPU and Cluster nodes can execute all of the relational algebra and aggregate function operators for relational databases.

PHDPA is hardware to accelerate dynamic programming alignment in order to solve huge time complexity. We designed a dedicated processor to calculate sequence alignment score using dynamic programming. According to tabulation computation of dynamic programming, we develop a control strategy for general use. Then, PHDPA modules can be applied on PC cluster.

(7)

誌 謝

本論文順利完成,首先要感謝指導教授 許文龍博士在此期間的諄諄教誨 與辛勤的指導,對於教授的教導精神與鼓勵,僅此表達心中最誠摯的敬意與感 謝。

此外要感謝同學紹貴、銘杰、偉峰、盈欽、家輝、立偉、家正、旻芳、美蘭 的互相勉勵,還有關心我的老師及好友們,感謝你們平時不斷的關懷與激勵,在 我心情低落而有所懈怠時能適時激發我的鬥志與積極的精神。

最後要感謝的是我親愛的家人給我的支持與幫助,雖有時感到疲憊與沮喪,

但溫暖的親情總是能使我消除一切的疲勞,繼續努力不懈。

僅以本文獻給我最敬愛的師長及最親愛的家人、以及所有關心支持我的人,

願將這份研究成果與喜悅,與你們共享。

賴韋丞 九十三年六

(8)

Contents

Abstract in Chinese ………....i

Abstract ……….ii

Acknowledgment ……….iii

Contents ………...iv

List of Figures ………..vi

List of Tables ………..vii

Chapter 1 Introduction ………..1

1.1 Object Relational DataBase Machine………..1

1.2 Cluster-based Search Algorithm ……….3

1.3 Sequence Alignment ………...4

1.4 Genetic Algorithm of TSP and MSA ………..8

1.5 Dissertation Purpose and Organization ………...9

Chapter 2 The Concept of a Bioinformatic Machine ...………..10

2.1 Propose a Bioinformatic Machine ...………..10

2.2 Database Mining ………...12

2.2.1 Object Relational Database Machine ……….12

2.2.2 m-1 Way Virtual Search Tree Algorithm ………..14

2.3 Sequence Alignment ……….16

2.3.1 Hardware of Dynamic Programming Alignment ...16

2.3.2 Hardware of Genetic Algorithm for TSP and MSA ...17

Chapter 3 Implement ROPU Hardware Module for a ORDBM ………...21

3.1 Definition of Four Algorithms ………..21

3.1.1 Ugsel and Mgsel algorithm ………21

(9)

3.1.3 Match algorithm ...24

3.2 Design ROPU and It’s Interface I/O Cache ………..24

3.2.1 I/O Cache ………...24

3.2.2 Microcode Generator ……….25

3.2.3 ROPU ...27

3.2.4 Edge Tuple Detection ………...28

3.2.5 Backend Controller ………...29

3.3 Simulate Result and Performance Analysis ………..29

3.3.1 Simulate Result ………...29

3.3.2 Performance Analysis ………...32

Chapter 4 Parallel Hardware of Dynamic Programming Alignment …………33

4.1 Overview our PHDPA Architecture ………...33

4.1.1 Our PHDPA Architecture ………..33

4.1.2 Dedicated PHDPAs on PC cluster ……….34

4.2 Data Representation of the PHDPA ……….35

4.3 Design the PHDPA Components ……….38

4.3.1 Multiple Processor Units ………...39

4.3.2 Sequence Flow ………..40

4.3.3 Scoring Matrix ………..41

4.3.4 Storage ………..43

4.3.5 Processor Unit ………..45

4.3.6 Control Unit ……….45

4.4 Performance Analysis ……….47

Chapter 5 Conclusion and Future Works ………..49

5.1 Conclusion ………..49

5.2 Future Works ………..49

(10)

Reference ………50

Appendix A ………53

List of Figures

Figure 2.1 Overview our bioinformatic machine based on PC cluster ………...11

Figure 2.2 The prototype of ORDBM ………14

Figure 2.3 Ordered data distributed in cluster server farm ...15

Figure 2.4 m-1 way virtual search tree (when m=5) ………...15

Figure 2.5 m-1 way virtual search tree building process ………....16

Figure 2.6 Traversal of a tree in circular order ………...18

Figure 2.7 Hardware/Software Partitioning ………...19

Figure 2.8 The SHGA block diagram ……….20

Figure 2.9 Pipelined and parallel configurations model ……….20

Figure 3.1 I/O Cache ………...25

Figure 3.2 Microcode block diagram ………...26

Figure 3.3 The architecture of ROPU processor element ………...27

Figure 3.4 Edge Tuple Detector compare 2 data segment edge tuple algorithm …....29

Figure 3.5 Backend Controller block diagram ………...29

Figure 4.1 illustrates the PHDPA(right dotted rectangle) on the general purpose computer(left dotted rectangle) ……….34

Figure 4.2 illustrates server host commands to four nodes PC cluster with four dedicated PHDPAs ………...35

Figure 4.3 Overall processions T=1~8 ………...38

Figure 4.4 The processions T=1~8 ……….39

(11)

address ………..40

Figure 4.6 Path (a) sequences from the database, path (b) sequences from the host query sequence ………..41

Figure 4.7 Lower triangle in row major order ………42

Figure 4.8 CREW PRAM is designed by four multiplexors ………..43

Figure 4.9 Inputs s0, s1, s2, s3 are from Multiple Processor Units, Input start save is from Control Unit ……….………44

Figure 4.10 Dotted rectangle is Storage component ………...44

Figure 4.11 Our dedicated processor ………..45

Figure 4.12 Control Unit state diagram ………..46

Figure 4.13 Illustrate Table 4.4 ………...48

List of Tables

Table 2.1 Ugsel, Mgsel, Gadd, Match execute all of the relational algebra and aggregate function ………..13

Table 3.1 The instruction set table of microcode operations ………..26

Table 4.1 One- and three-letter amino acid codes and five bits representation ……..30

Table 4.2 Six bits 2’s complement and decimal digit ……….……37

Table 4.3 PAM250, lower triangle score in row major order ……….……42

Table 4.4 Shows PHDPA speed up with DPA assembly ……….47

(12)

Chapter 1 Introduction

The growth rate of genetic databases is ever increasing. In 1990 Genbank was doubling in size every 22 months, In 2000 Genbank’s doubling rate was 6 months.

Moore’s law tells us that every 18 months CPU-based computing gets twice.In one 18 month period, Moore’s law will have reduced computing cost by half, but the required computing load will have grown by 8-fold or more in the same time [1].

Owning to this, we design a special purpose hardware to improve that CPU-based computing can not keep up with genetic database increasing.

Toward design of a bioinformatic machine is the designment of special purpose hardware. Because computation of bioinformatic algorithm is demanded, we use hardware to accelerate bioinformatic algorithm. Further more, we parallelize special hardware and search software on PC cluster.

Toward design of a bioinformatic machine has include object relational database machine, cluster-based search algorithm, sequence alignment and genetic algorithm.

In section 1.1, introduce our object relational database machine. In section 1.2 introduce the cluster-based search algorithm. In section 1.3, introduce the sequence alignment. In section 1.4, introduce Genetic algorithm for TSP and MSA. In section 1.5, there is a dissertation purpose and organization.

1.1 Object Relational DataBase Machine

(13)

the sixties, but the still unable extensive application of this kind of technology so far.

Now, the data produce and increase, the traditional computer structure has been already unable to deal with a large amount of data. In the past research, the RAP and CASSM increase to store the ability of data processing of the system, but can't handle the complicated to calculate. DBC、DIRECT、SHUFFLE NETWORK and CUBIC NETWORK, make use of many processors in the main memory to handle the data, and need the cost of the high communication and software, and also have the unsteady problem. SYSTOLIC ARRAY and QUERY PROCESSOR handles the different to calculate with different array of VLSI processor, and the complicacy of system and ability is limited. NON-VON and GRACE resolve this problem with RELATIONAL ALGEBRA MACHINE of VLSI technique design, and the NON- VON use the structure of the many processors of tree , processor construction complicacy, and the communication between of processor is difficult, so the ability is limited. GRACE uses hash technique and many hardware of scheduler, but there is data distributing very and not and all.

Several technique problem of parallel processing of solution

(1) Data well distribution perfectly: Our ORDBM handles the data well distribution with pure ROPU [2][3] of hardware, therefore have no before parallel computer problem.

(2) Be suitable to high parallel: Parallel algorithm with the pure hardware. When the parallel and system complicacy to increase, software complicacy does not increase, and only increase the hardware complicacy, therefore system can be stable and quick.

(3) Simple parallel language: Our ORDBM belongs to high level language that support SQL language can solve parallel programming problem.

(14)

(4) Don not need the interconnection network: ROPU is a pure array of hardware data processing and is a static configuration design. Only change the SQL language can flexibility the data reorganization [4][5][6].

(5) Use the pure hardware to make the data reorganization, then use the software to data reorganization flexibility. Not only execute operation of database, but also execute many scientific problem like matrix multiplication.

(6) Our ORDBM base on PC cluster, therefore can operation of support PC cluster and Grid computing.

(7) Our ORDBM is low cost and be provided the next generation desktop computer [7][8].

1.2 Cluster-based Search Algorithm

As the computer technology have a great improvement in hardware performance and physical size, the major components of computer system are getting cheaper. It becomes increasing affordable to equip computers with large main memories. A lot of cheaper and powerful computers connected via high-speed LAN network in the distributed cluster environments seem to be the future architecture trend.

There are challenges to deal with data-intensive applications when all the massive data was centralized in the same host. In the past days, B-tree [9][10] index structures and many variants[11][12][13] are the most important data structures in database systems. Recently, more and more researchers pay attention to cache conscious B+-tree [14][15] data structures. Also, as we know a tree operation consists of some basic operations such as data comparison, pointer assignment, arithmetic operation, acquisition and release of semaphore, and locking and unlocking. These

(15)

In this dissertation, we try to employ large memory in the cluster environments to deal with data-intensive applications. We suggest hardware-based cluster architecture, by distributing the partial data to the individual node and implementing an m-1 way virtual tree search algorithms [16]. By fully utilizing the in-memory operation, our algorithm can ingeniously avoid B+-tree index maintenance overhead while keeping the search tree algorithm efficiency.

1.3 Sequence Alignment

Sequence alignment is useful for discovering function and structure and to understand evolution. Sequences that are very similar probably have a similar function and/or structure. Two sequences from different organisms that are similar may have a common ancestor in Figure 2.1. The sequences are then said to be homologous.

Ancestor x steps

y steps Sequence 1

Sequence 2

Figure 2.1 Evolution

The alignment indicates the evolutionary process. The different sequences start from the same ancestor sequence and then change with mutations including insertions, deletions and substitutions. Assume the following evolution.

(16)

seq1: ACTGC

substitution ACCGC

deletion A–CGC

insertion seq2: ACGTC

The relationship between seq1 and seq2 is ACTG–C A–CGTC

Gap, match and mismatch are the align operators. Gap represents insertion or deletion or also character aligns with space. Match represents conserve residue or also identical characters are aligned. Mismatch represents substitution or also different characters are aligned.

Target of sequence alignment could be nucleotide and also be amino acid.

However, both of them are the same align operators. Following amino acid sequence alignment that align seq1: AGCDEKRVIG and seq2: AGEYCDKRIIG.

Insertion in seq1

or deletion in seq2.

seq1: A G – – C D E K R V I G seq2: A G E Y C D – K R I I G

Deletion in seq1 Substitution or insertion in seq2.

(17)

For example, align seq1: HEAGAWGHEE and seq2: PAWHEAE.

HEAGAWGHEE PAWHEAE– – – HEAGAWGHEE – – –PAWHEAE HEAGAWGHE–E – – –PAW–HEAE

Which alignment is better? Even is optimal? We need a score model that has scoring matrix and gap penalty.

A scoring matrix gives the score for aligning two amino acids (match or mismatch) in a pairwise alignment. A scoring matrix can be considered a measure of the evolutionary change. The most widely used matrices are PAMs and BLOSUMs [17]. Both calculate substitution frequencies between amino acids, and both are derived from known protein alignments.

PAM is the unit of Point Accepted Mutations (Dayhoff et al. 1978). PAM1 is the amount of evolutionary change on average 1 mutation per 100 amino acids. PAM40 is good for sequences with small evolutionary distance, especially for short, strong local similarities. PAM250 is good for long sequences with large evolutionary distance.

PAM20 is suitable for comparing human and mouse.

The BLOSUM (BLOcks SUbsitution Matrix) scoring matrix (Henikoff &

Henikoff, 1992) is similar to PAM. BLOSUM50 and BLOSUM62 are the ones most widely used. BLOSUM50 is the default for the FASTA sequence analysis program.

I.

II.

III.

(18)

The BLOSUM62 is most effective detecting known members of a protein family from a database when searching with the BLAST local alignment program.

Gap penalties are often composed of two parts: a gap opening penalty –d and a gap extension penalty –e. Gap penalties for a gap of length g

˙Linear gap penalties: γ(g) = –g d

˙Affine gap penalties: γ(g) = –d – e(g – 1)

Dynamic Programming (DP) is an efficient recursive method to search all possible alignment and finding the one optimal score. Dynamic programming usually consists of three components.

˙Recursive relation

˙Tabular computation

˙Traceback

Needleman-Wunsch (global alignment)

Align two sequences x1 x2 … xn and y1 y2 … ym and create an m x n matrix F where

F(i, j) = score of optimal path of subsequence x1 … xi and y1 … yj

Assume a linear gap penalty –d. Recursive relation F where

F(i, j) = max





d j i F

d j i F

x x s j i

F i j

- 1) - , (

- ) 1, - (

) , ( - 1) - , 1 - (

with F(0,0) = 0, F(i,0) = -id, F(0, j) = -jd.

(match/mismatch) (gap in y)

(gap in x)

(19)

Smith-Waterman (local alignment)

The local alignment algorithm is very similar to the global, it only slightly modify.

Align two sequences x1 x2 … xn and y1 y2 … ym and create an m x n matrix F where

F(i, j) = score of optimal path of subsequence x1 … xi and y1 … yj

Assume a linear gap penalty –d. Recursive relation F where

F(i, j) = max





0 - 1) - , (

- ) 1, - (

) , ( - 1) - , 1 - (

d j i F

d j i F

x x s j i

F i j

with F(0,0) = 0, F(i,0) = 0, F(0, j) = 0.

1.4 Genetic Algorithm for TSP and MSA

The computations of multiple sequence alignments (MSAs) still have the major open problems in computational biology. Consequently, MSAs is NP-complete, and many researches were proceeded for the MSAs problem, such as the paper of Chantal of Korostensky and Gaston Gonnet [18] propose a method for MSAs calculation.

The method, which solves the Traveling Salesman Problem (TSP) for MSAs can reduces the search space from NP-complete to O ( k2n2 ) time, where k is the length of the longest input sequence. This is a very interesting solution, though MSAs and TSP are NP-complete, in practice, the TSP have many optimal solution, such as a Genetic Algorithms (GA) [19] is a powerful tool for TSP [20][21][22].

GA for TSP has been applied to many hard optimization problems, but if the search space is large, it will cause unacceptable delays in the software

(match/mismatch) (gap in y)

(gap in x)

(20)

implementations. The related researches of GA for TSP optimization almost point on initial excellent population generate, crossover and mutation method research.

We hope we can improve the execution efficiency of genetic algorithms approaching by the way of probe into hardware consideration.

Base on the research motivation and background, we can realize if we apply GA in the case of huge amount of data, the problem of speeding up will be very important.

And hardware implementation of GA could be very useful in speeding up this optimization process. On the other hand operations in hardware parallelizing and pipelining can be implemented very easily. For this purpose, this thesis is try carry out the goal using hardware implementation to improve the execution efficiency of GA greatly, for applying to TSP for GA solution which needs to cope with a lot of data, such as MSAs.

1.5 Dissertation Purpose and Organization

This dissertation is organization as follows: In chapter 2, introduce the concept of a bioinformatic machine. In chapter 3, implement ROPU hardware module for a ORDBM. In chapter 4, implement parallel hardware of dynamic programming alignment. Finally, conclusion and future works are given in chapter 5.

(21)

Chapter 2

The Concept of a Bioinformatic Machine

Our concepts of a bioinformatic machine deal with huge biological sequence by special hardware design on PC cluster. The bottleneck of huge biological sequence, first, size of files or database is large in order to computer I/O burst. Second, Given biological sequence to computing could be time-consuming depending on algorithm.

Further more, we focus on database mining and sequence alignment performance for hardware solutions.

In section 2.1, the concepts of purpose a bioinformatic machine. In section 2.2, the concepts of design ORDBM hardware and develop m-1 way virtual search tree software. In section 2.3, the concepts of PHDPA hardware and SHGA hardware.

2.1 Propose a Bioinformatic Machine

The concepts of a bioinformatic machine have four major aspects which are hardwarelize algorithm, parallelize computation, data mining, and sequence alignment.

Because computation of bioinformatic algorithm is demanded, we use hardware to accelerate bioinformatic algorithm. Further, integrate hardwarelize algorithm into PC cluster in order to parallelize computation. PC cluster use Linux as operating system and use PVM or MPI as parallel programming language and use PostGreSQL

(22)

as object relational database. These software are free open source code. Overview our bioinformatic machine based on PC cluster in Figure 2.1. Our special hardware on PC cluster which are ROPU processor, search algorithm (SA), dynamic programming (DP), genetic algorithm (GA).

Figure 2.1 Overview our bioinformatic machine based on PC cluster.

Database mining separates into two parts which are hardware and software. On the hardware, we design Object Relational DataBase Machine (ORDBM) builds on PC cluster with four nodes. On the software, we develop a search algorithm assumed the PC cluster data are sorted and distributed. Altogether, the data sorting and data well distribution are on the hardware from the first, and the data search is on the software from the second.

Sequence alignment is differentiated between pairwise sequence alignment and multiple alignment. To find optimal pairwise alignment can use dynamic

I/O SA DP GA

node0

I/O SA DP GA

node2

I/O SA DP GA

node1

I/O SA DP GA

node3

ROPU

Ethernet

DB DB DB DB

Host

(23)

time-consuming. From this reason, we develop parallel dynamic programming and design it to a special hardware. To find multiple sequence alignment is NP-complete.

Therefore, we solve multiple sequence alignment by traveling salesman problem approach. Then, use genetic algorithm to find traveling salesman problem path. We design crossover operators and mutation operators to hardware.

2.2 Database Mining

The size of the GenBank/EMBL/DDBJ nucleotide database is now doubling every 15 months. Sequence database searching is among the most important and challenging tasks in bioinformatics. Therefore, we develop four database mining algorithms and search algorithm.

2.2.1 Object Relational DataBase Machine

We design the four way extended sorter according to bitonic network. The four way extended sorter in addition to sorting, this is data well distribution. It averagely distributes data to PC cluster with four nodes. Then we regard the four way extended sorter as main kernel to define Ugsel, Mgsel, Gadd, and Match four algorithms. They can execute all of the relational algebra and aggregate function operator in Table 2.1.

We use VHDL hardware description language to simulate. About these four algorithms will be explained in detail in chapter 3.

(24)

Union, Intersection, Difference Ugsel, Mgsel Cartesian Product Cluster

Selection Cluster

Projection Ugsel, Cluster

Equi Join Mgsel, Match,

Gadd

Nonequi Join Gadd, Cluster

Division Mgsel, Match,

Gadd Min, Max over a group or

groups

Ugsel, Mgsel Sum, Count, AVE over a group

or groups

Gadd

Table 2.1 Ugsel, Mgsel, Gadd, Match execute all of the relational algebra and aggregate function.

The prototype of design the parallel object relational database machine shows in Figure 2.2. Our ORDBM consists of MPU and ROPU processors. MPU is a PC cluster including four x86 processors which are embedded linux. ROPU is a massively processor array including 1376 bit-sliced processors that every processor demands 60 logical gates. ROPU can process sort and Ugsel, Mgsel, Gadd, and Match four algorithms. MPU and ROPU can execute all of the relational algebra and aggregate function operator in Table 2.1.

(25)

Figure 2.2 The prototype of ORDBM.

SWITCH component can check valid bit. If data segment are order then output, else feedback F cache and keep on process with input data segment. Therefore, the data will be well distribution perfectly.

2.2.2 m-1 Way Virtual Search Tree Algorithm

After section 2.2.1 hardware process, we develop a search algorithm assumed the PC cluster data are sorted and distributed.

Suppose we have the following PC cluster with five hosts (p1 ~ p5) as showed in Figure 2.3. There are 64 records with ordered data equally distributed in the cluster farm.

MPU

ROPU

(26)

Figure 2.3 Ordered data distributed in cluster server farm.

By keeping the partial data in the local memory of clustered hosts, our system forms the virtual search tree, as the Figure 2.4 shows. Whenever we search a data, every host simply looks up the array of partial data keeping in the local host memory.

So we can keep the B+-tree search efficiency while avoiding the system overhead of maintaining the tree data structure as described before.

p1 p2 p3 p4 1 2 3 4

p5 p1 p2 p3 5 6 7 8

p4 p5 p1 p2 9 10 11 12

p3 p4 p5 p1 13 14 15 16

p2 p3 p4 p5 17 18 19 20

p1 p2 p3 p4 21 22 23 24

p5 p1 p2 p3 25 26 27 28

p4 p5 p1 p2 29 30 31 32

p3 p4 p5 p1 33 34 35 36

p2 p3 p4 p5 37 38 39 40

p1 p2 p3 p4 41 42 43 44

p5 p1 p2 p3 45 46 47 48 p4 p3 p2 p1

4 8 12 16

p5 p4 p3 p2 20 24 28 32

p1 p5 p4 p3 36 40 44 48 p1 p2 p3 p4

16 32 48 64

p2 p1 p5 p4 52 56 60 64

p4 p5 p1 p2 49 50 51 52

p3 p4 p5 p1 53 54 55 56

p2 p3 p4 p5 57 58 59 60

p1 p2 p3 p4 61 62 63 64

Figure 2.4 m-1 way virtual search tree (when m=5).

p1 p2 p3 p4 p5

1 2 3 4 5

6 7 8 9 10

11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64

(27)

Now we show how the m-1 way virtual search tree was built up below. As Figure 2.5 shows, the m-1 way virtual search tree was first built upon the leave nodes like the node split process in B+-tree.

1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64 4,8,12,16 ,20,24,28,32, 36,40,44,48, 52,56,60,64

16,32,48,64

level 3 level 2 level 1

Figure 2.5 m-1 way virtual search tree building process.

2.3 Sequence Alignment

Sequence alignment is useful for discovering function and structure and to understand evolution. Sequences that are very similar probably have a similar function and/or structure. Dynamic Programming (DP) is an efficient recursive method to search all possible alignment and finding the one optimal score.

2.3.1 Hardware of Dynamic Programming Alignment

We design the special purpose hardware to solve time consuming of tabular computation of dynamic programming [23][24][25] which called PHDPA. After we designed, we embedded the PHDPA into the PCI card the PHDPA on the general purpose computer using PCI application.

The PHDPA contain five components: Sequence Flow, Scoring Matrix, Multiple Processor Units, Storage, and Control Unit. The Sequence Flow is controlled the

(28)

database sequences and query sequences timing into Multiple Processor Units.

Scoring Matrix can be saved any kind of PAMs or BLOCKSUMs is one dimension CREW PRAM with four data buses and four address buses. Multiple Processor Units has four dedicated processor to generate tabular computation is the major component in PHDPA. Storage is saved tabular computation from Multiple Processor Units.

Control Unit is controlled when Sequence Flow, Multiple Processor Units and Storage do or stop or interaction each other. The PHDPA components will be explained in detail in chapter 4.

The two major business of general purpose computer one is accessing database sequences into main memory. Then, sequences into main memory are transferred into sequence flow component in the PHDPA by PCI interface. Two is transferring storage score in the PHDPA into main memory in the general purpose computer by PCI interface. Then, let programs can access score in order to traceback.

Our PHDPA architecture is applied on PC cluster. Every node is linux based PC that has a PCI card with dedicated PHDPA. The server host is also linux based PC, but has not a PCI card with dedicated PHDPA. The server host has PVM or MPI that is a parallel programming language in order to program every node to start PHDPAs.

For example, the illustration is server host program to four nodes PC cluster with four dedicated PHDPA.

2.3.2 Hardware of Genetic Algorithm for TSP and MSA

The multiple similarities suggest a common structure of the protein product, a common function or a common evolutionary source [26]. The best MSAs is

(29)

length n the space complexity is of O(n k) and the time complexity is of O(2knk), that MSAs is NP-complete [27].

Therefore, in [1] show a new algorithm that can calculate a near optimal MSAs and has a performance guarantee of

n

n 1)

( −

·opt (where opt is the optimal score of the MSAs), and the algorithm run in O(k2n2) time, where k is the length of the longest input sequence.

The algorithm is based on using TSP approach. It has the advantage that no evolutionary tree must be constructed [28]. The algorithm considers is to find a circular tour without knowing the tree structure. A circular order of MSAs is any tour though a tree where each edge is traversed exactly twice, and each leaf is visited once (see Figure 2.6). The calculation of such a tour using TSP algorithm can be done very efficiently. This algorithm use a modified version of TSP, the cities correspond to the sequences and the distance are the scores of a optimal pairwise alignment, and to find the longest tour where each sequence is visited once. In this paper sum-of-pairs in circular order.

Figure 2.6 Traversal of a tree in circular order.

(30)

We will partition the GA system into software dominated partition and hardware dominated partition, as shown in Figure 2.7. The software dominated partition includes embedded system and part operation of GA, such as initial population, select, and terminal. The main hardware dominated partition includes crossover unit, mutation unit, and fitness unit.

Start

Terminal?

End Yes

No Encoding

Fitness Computation Crossover

Mutation

Offspring

Offspring Parent pool

Offspring pool Select Population

Size = N

O N

New Population

Distances matrix chromosomes

Initial Population Build

Hash Table

chromosomes Set up parameter Pc

Pm pop_size the condition of evolve the end

Software Dominated Partition

Hardware Dominated Partition

I / O

(31)

Our SHGA hardware includes eleven unit is shown in Figure 2.8. In this design the crossover function and mutation function is parallel operation and two fitness function is, too. The overall operation of SHGA system is pipelined, as shown in Figure 2.9.

Embedded System

Enternet Controller

UART

CPU

RAM

FLASH

Crossover

Mutation Fitness

Hash Table Build

CMP_RNG

E 1

RAM 64K*16

E 2

RAM 64K*16

Hash

RAM 64K*16

P 1

RAM 64K*16

O 1

RAM 64K*16

O 2

RAM 64K*16

Distance Matrix

SHGA

Register P_RNG

Maincontroller

P 2

RAM 64K*16

Fitness

Figure 2.8 The SHGA block diagram.

1 2 3 4 5 6 7 8 9 10

State No.

Tour No.

1

2

3

4

Receive data from Embedded

System (P1、P2)

Crossover Mutation

Fitness (E1) Fitness (E2)

Send data to Embedded

System (E1、E2)

Receive data from Embedded

System (P1、P2)

Crossover Mutation

Fitness (E1) Fitness (E2)

Send data to Embedded

System (E1、E2)

Receive data from Embedded

System (P1、P2)

Crossover Mutation

Fitness (E1) Fitness (E2)

Send data to Embedded

System (E1、E2)

Receive data from Embedded

System (P1、P2)

Crossover Mutation

Fitness (E1) Fitness (E2)

Send data to Embedded

System (E1、E2)

Receive data from Embedded

System (P1、P2)

Crossover Mutation

Fitness (E1) Fitness (E2)

Send data to Embedded

System (E1、E2)

11 12

5

Figure 2.9 Pipelined and parallel configurations model.

(32)

Chapter 3

Implement ROPU Hardware Module for ORDBM

In section 3.1, we will define Ugsel, Mgsel, Gadd, and Match four algorithms. In section 3.2, design the ORDBM component. In section 3.3, Our ORDBM result and performance analysis.

3.1 Definition of Four Algorithms

3.1.1 Ugsel and Mgsel algorithm DEFINITION 1 (Definition of Ugsel)

Let di be one of the distinct values of attribute Ag and assume that Tn, Tn+1, …, Tn+mi

are the only tuples in which Ag takes the value di. Further assume that the value of the comparison attribute in Tn +j is sij, and si = MAX(si0, si1, …,sim)

(a) In Ugsel (Unique generalized selection), if sij = si then T n+j is a qualified tuple, all other tuples with a tuple by attribute value di are rejected.

(b) In Mgsel (Multiple generalized selection) tuple T n+j, 0<=j<=mj, will be a qualified tuples only when sij = si

Example

A Table B Table

--- --- S# SNAME STATUS CITY S# SNAME STATUS CITY --- --- S1 Smith 20 London S1 Smith 20 London S2 Clark 20 London S2 Jones 10 London

(33)

The input data segment is

---

Ag As Mb

--- S1Smith20London S1 1 S1Smith20London S1 1 S2Jones10Paris S1 1 S4Clark20London S1 1 --- After Ugsel algorithm processing ---

Ag As Mb

--- S1Smith20London S1 0 S1Smith20London S1 1 S2Jones10Paris S1 1 S4Clark20London S1 1 ---

Delete duplicated tuples choosing the Mb=0, then get union operation result ---

S# SNAME STATUS CITY --- S1 Smith 20 London S2 Jones 10 Paris S4 Clark 20 London ---

Ugsel processing also can implement intersection operation. Choose the Mb=0, then get intersection operation result

--- S# SNAME STATUS CITY --- S1 Smith 20 London ---

(34)

3.1.2 Gadd algorithm

DEFINITION 2 (Definition of Gadd)

In Gadd (generalized addition operation), let eij be the value of Aa in tuples T n+j, 0<=j<=mi. In the generalized addition operation:

(a) For every group of tuples Tn, …, Tn+mi, accept one of the tuple (say n+k) by setting the corresponding mark bit, and clearing the mark bits of the other tuples.

(b) The addition attribute of Tn+k becomes ei = SUM(ei0, ei1, …,eim)

Example

Query: calculate average salary in each department in table A.

A Table --- Dep Salary --- D3 20000 D1 20000 D2 30000 D3 40000 ---

A table transfer the input data segment ---

Ag AD1 AD2 Mb --- D3 20000 1 1 D1 20000 1 1 D2 30000 1 1 D3 40000 1 1 ---

(35)

After Gadd algorithm processing, get the result choosing Mb=1 ---

Ag AD1 AD2 Mb --- D1 20000 1 1 D2 30000 1 1 D3 00000 0 0 D3 60000 2 1 ---

3.1.3 Match algorithm

DEFINITION 3 (Definition of Match)

In Match (matching operation), let di be one of the distinct values of attribute Ag, and assume that Tn, …, Tn+r are the tuples in which Ag takes the value di. Further assume that these tuples belong to augmented S tuples.

(a) If there are the tuples Tm, Tm+1,…, Tm+s in which Ag takes the value di, too. And these tuples belong to augmented S tuples. Then Tn, Tn+1,…, Tn+r and Tm, Tm+1,…, Tm+s will be qualified tuples.

(b) If Tn, …, Tn+r are the only tuples in which Ag takes the value di and di is out of Tm, Tm+1,…, Tm+s. Then these tuples will not be qualified.

3.2 Design ROPU and It’s Interface I/O Cache

3.2.1 I/O Cache

I/O Cache can receive data from MPU, stored temporarily and then send data to send data to sorter. It also can receive data from SWITCH, sorted temporarily and output from MPU. I/O cache includes input port and output port in Figure 3.1.

I/O Cache can connect control bus and provide cluster, host and ROPU to

(36)

communication. I/O Cache has a shift register to execute 16 data bits inout ROPU.

Figure 3.1 I/O Cache.

3.2.2 Microcode Generator

MPU can control Microcode Generator to generate Microcode sequence. The sorter and merger can process input data according to Microcode sequence.

“I1” is to clear all flags. It is always used for the first Microcode of the Microcode sequence. “I2” can be used to compare the input data, set flags fc1 and fc2 and output data with ascending order. “I3” also is used to compare the input data, but output data with descending order. “I4” is used to store the result of comparing group attributes.

“I5”, “I6”, “I7”, “I8” are used to manipulate mark bits for our parallel algorithms. The Microcodes are defined in Table 3.1. Microcode block diagram in Figure3.2

(37)

Table 3.1 The instruction set table of microcode operations.

Figure 3.2 Microcode block diagram.

(38)

3.2.3 ROPU

ROPU is a massively processor array, and bit sliced processors of this array are especially designed. This processor contains Bit Comparator, Bit Manipulator, Bit Adder, Function Selector and SWITCH. Figure 3.3 shows the overall architecture of ROPU bit sliced processor.

Figure 3.3 The architecture of ROPU processor element.

The main fuctions of sorter and merger are described below:

˙Bit Adder:

The two input tuples are comared in bit comparator first, and then bit comparator sends flag fc1 to bit adder. The flag fc1=0 represents the input tuple A is equal to B and otherwise A is not equal to B. Microcode I8 and flag fc1 will decide whether bit adder needs to execute addition operation or not.

˙Bit Comparator:

Bit comparator compares a serial of input data bits. The result of comparison will be sent to other modules for processing bit addition and bit manipulation.

˙Bit Manipulator:

(39)

two input mark bits according to their Microcodes. The Microdes and flags are I5, I6, I7, fmb and fc2. Bit Manipulator manipulates mark bits according to input Microcodes and flags.

˙Function Selection:

Function selector selects data bits from bit adder, bit comparator or bit manipulator according to the input Microcode I2-I8.

˙SWITCH:

The pattern of switch is decided by the input data bits or mark bit, which are sent from bit adder, Bit comparator and bit manipulator and pass through function selector.

˙Microcode Register:

Microcode sequence is piped into ROPU processor element form one stage to another. Microcode registers are used to stored these pipeline Microcode sequence.

˙Muliplexer:

Multiplexer can be controlled by I1 and LB to decide whether the data is processed in this processor or not.

3.2.4 Edge Tuple Detector

The device can compare 2 data segment edge tuples and output the compared result to MPU. MPU can use this result to decide the next data segment inputting to I/O Cache, when two-way merge algorithm is performing. Figure 3.4 shows Edge Tuple Detector compare 2 data segment edge tuple algorithm.

(40)

Figure 3.4 Edge Tuple Detector compare 2 data segment edge tuple algorithm.

3.2.5 Backend Controller

Backend Controller can control Microcode Generator to generate Microcode sequence. ROPU will process data according to this Microcode sequence. Figure 3.5 shows Backend Controller block diagram.

Figure 3.5 Backend Controller block diagram.

3.3 Simulate Result and Performance Analysis

3.3.1 Simulate Result

First, Host must initiate Microcode Generator and generate series micro-

(41)

ROPU and PC cluster can execute all of the relational algebra and aggregate function operator in Table 3.2.

Union, Intersection, Difference Ugsel, Mgsel Cartesian Product Cluster

Selection Cluster

Projection Ugsel, Cluster

Equi Join Mgsel, Match,

Gadd

Nonequi Join Gadd, Cluster

Division Mgsel, Match,

Gadd Min, Max over a group or

groups

Ugsel, Mgsel Sum, Count, AVE over a group

or groups

Gadd

Table 3.2 Ugsel, Mgsel, Gadd, Match execute all of the relational algebra and aggregate function.

Example1: Ugsel and Mgsel algorithm.

(42)

Example 2: Gadd algorithm

Microinstruction: I1, I2, I2, I8, I8, I5

Example 3: Division algorithm, separate two pass to execute.

One pass:

Microinstructions: I1, I2, I2, I2, I2, I2, I6 I4, I4,

Two pass:

Microinstructions: I1, I2, I2, I8, I5

(43)

Example 4: Equi-join algorithm.

Microinstructions: I1, I2, I2, I2, I7, I7, I2, I6, I2, I2 I4, I4,

3.3.2 Performance Analysis

Assume sorting algorithm can process database record and record operation. For a given n records per table. The time complexity

1. Workstation or PC is

n x (log n) x (Retrieval time + Manipulation time) 2. PC cluster with k nodes far greater than

(n / k) x (log n) x (Retrieval time + Manipulation time) Spend high communication cost and software cost.

3. Suppose Retrieval time greater than Manipulation time. Every processor can process 16 records.

Our ORDBM approximate

(n / k) x [log n / (16 x k)] x (Retrieval time)

(44)

Chapter 4

Parallel Hardware of Dynamic Programming Alignment

In section 4.1, our PHDPA architecture is implemented through PCI interface card.

Then, dedicated PHDPAs can be applied on PC cluster. In section 4.2, we will define data representation of 20 kinds of amino acids and scoring matrix. In section 4.3, design the PHDPA components. In section 4.4, time complexity.

4.1 Overview Our PHDPA Architecture

4.1.1 Our PHDPA Architecture

We design the special purpose hardware to solve time consuming of tabular computation of dynamic programming which called PHDPA. After we designed, we embedded the PHDPA into the PCI card. Figure 4.1 illustrates the PHDPA on the general purpose computer using PCI application.

The PHDPA contain five components: Sequence Flow, Scoring Matrix, Multiple Processor Units, Storage, and Control Unit. The Sequence Flow is controlled the database sequences and query sequences timing into Multiple Processor Units.

Scoring Matrix can be saved any kind of PAMs or BLOCKSUMs is one dimension CREW PRAM with four data buses and four address buses. Multiple Processor Units has four dedicated processor to generate tabular computation is the major component in PHDPA. Storage is saved tabular computation from Multiple Processor Units.

(45)

do or stop or interaction each other.

Figure 4.1 illustrates the PHDPA(right dotted rectangle) on the general purpose computer(left dotted rectangle).

The two major business of general purpose computer one is accessing database sequences into main memory. Then, sequences into main memory are transferred into sequence flow component in the PHDPA by PCI interface. Two is transferring storage score in the PHDPA into main memory in the general purpose computer by PCI interface. Then, let programs can access score in order to traceback.

4.1.2 Dedicated PHDPAs on PC cluster

Our PHDPA architecture is applied on PC cluster. Every node is linux based PC that has a PCI card with dedicated PHDPA. The server host is also linux based PC, but has not a PCI card with dedicated PHDPA. The server host has PVM or MPI that

(46)

is a parallel programming language in order to program every node to start PHDPAs.

For example, the illustration is server host commands to four nodes PC cluster with four dedicated PHDPA in Figure 4.2.

Figure 4.2 illustrates server host commands to four nodes PC cluster with four dedicated PHDPAs

4.2 Data Representation of the PHDPA

There are as many as 100 thousand kinds of proteins that constitute the body, and these comprise only 20 kinds of amino acids in various combinations. These 20 kinds of amino acids are essential to the body. Therefore, we use five bits recognition for 20

(47)

bits representation.

Code

Name 1-Letter 3-Letter Five bits representation

Alanine A Ala 00000

Arginine R Arg 00001

Asparagine N Asn 00010

Aspartic acid D Asp 00011

Cysteine C Cys 00100

Glutamine Q Gln 00101

Glutamic acid E Glu 00110

Glycine G Gly 00111

Histidine H His 01000

Isoleucine I Ile 01001

Leucine L Leu 01010

Lysine K Lys 01011

Methionine M Met 01100

Phenylalanine F Phe 01101

Proline P Pro 01110

Serine S Ser 01111

Threonine T Thr 10000

Tryptophan W Trp 10001

Valine V Val 10010

Tyrosine Y Tyr 10011

Table 4.1 One- and three-letter amino acid codes and five bits representation.

In the scoring matrix, for example PAM250, the maximum score is 17 aligned amino acid W and amino acid C and the minimum score is -8 aligned amino acid W and amino acid W. Therefore, the scores range from -8 to 17, we use six bits representation for the scores by 2’s complement. Table 4.2 shows six bits 2’s complement and decimal digit.

(48)

Decimal digit Positive Negative Decimal digit

0 000000

1 000001 111111 -1

2 000010 111110 -2

3 000011 111101 -3

4 000100 111100 -4

5 000101 111011 -5

6 000110 111010 -6

7 000111 111001 -7

8 001000 111000 -8

9 001001 110111 -9 not used

10 001010 110110 -10 not used 11 001011 110101 -11 not used 12 001100 110100 -12 not used 13 001101 110011 -13 not used 14 001110 110010 -14 not used 15 001111 110001 -15 not used 16 010000 110000 -16 not used 17 010001 101111 -17 not used 18 not used 010010 101110 -18 not used 19 not used 010011 101101 -19 not used 20 not used 010100 101100 -20 not used 21 not used 010101 101011 -21 not used 22 not used 010110 101010 -22 not used 23 not used 010111 101001 -23 not used 24 not used 011000 101000 -24 not used 25 not used 011001 100111 -25 not used 26 not used 011010 100110 -26 not used 27 not used 011011 100101 -27 not used 28 not used 011100 100100 -28 not used 29 not used 011101 100011 -29 not used 30 not used 011110 100010 -30 not used 31 not used 011111 100001 -31 not used

Table 4.2 Six bits 2’s complement and decimal digit.

(49)

4.3 Design the PHDPA Components

For example, amino acid sequence alignment that align database sequence:

CEQL and query sequence: WEZVY.

Query sequence

C E Q L

W -8 -7 -5 -2 E -5 -4 -4 -2 Z -5 -2 -1 -1

V -2 -2 -1 1

Y -2 -2 -1 1

Our PHDPA uses SIMD approach to parallelize tabular computation which has four dedicated processors labeled p0, p1, p2, p3. Figure 4.3 overall processions T=1~8.

Figure 4.3 Overall processions T=1~8

Database sequence

(50)

4.3.1 Multiple Processor Units

By SIMD approach, Multiple Processor Units has four dedicated processors that parallelize tabular computation. Pipeline the amino acid to processor. When T=1, input C and W to p0. When T=2, pipeline W to p1 and input E to p1 and input E to p0.

When T=3, pipeline W to p2 and pipeline E to p1 and input Q to p2 and input Z to p0.

When T=4, pipeline W to p3 and pipeline E to p2 and pipeline Z to p1 and input L to p3 and input V to p0. When T=5, pipeline E to p3 and pipeline Z to p2 and pipeline V to p1 and input Y to p0. When T=6, pipeline Z to p3 and pipeline V to p2 and pipeline Y to p1. When T=7, pipeline V to p3 and pipeline Y to p2. When T=8, pipeline Y to p3. Figure 4.4 shows the processions T=1~8.

Query sequence

C E Q L

W -8 -7 -5 -2 E -5 -4 -4 -2 Z -5 -2 -1 -1

V -2 -2 -1 1

Y -2 -2 -1 1

Database sequence

(51)

After input amino acid, every processor must calculate address simultaneously in order to find out score value in scoring matrix. Therefore, every processor owns address bus according to input amino acid and data bus according to address in Figure 4.5.

Figure 4.5 Address bus according to input amino acid and data bus according to address.

We will explain how to control the processions timing in next section Sequence Flow and will explain how to calculate address in section 4.3.3 Scoring Matrix and will explain dedicated Processor in detail in section 4.3.5.

4.3.2 Sequence Flow

The Sequence Flow is controlled the database sequences and query sequences timing into Multiple Processor Units. Sequence Flow consists of two caches, two shift registers, a demultiplexor, a counter and a signal start shift in Figure 4.6. Sequence Flow is controlled by Control Unit, so input signal start save that control shift register to start is from Control Unit.

(52)

Figure 4.6 Path (a) sequences from the database, path (b) sequences from the host query sequence.

4.3.3 Scoring Matrix

Scoring Matrix can be saved any kind of PAMs or BLOCKSUMs is one dimension CREW PRAM with four data buses and four address buses.

Character of any kind of PAMs or BLOCKSUMs is a symmetric matrix.

Therefore, we only save lower triangle score in row major order in Figure 4.7. For a given row index i and column index j, A(i, j) is given by

j k j

i A

i

k

+

=

=0

) ,

( i≧j

A(i, j) is computing lower triangle address in one dimension RAM. Calculating

(53)

Figure 4.7 Lower triangle in row major order.

Table 4.3 is PAM250, we save it lower triangle score to one dimension Scoring Matrix in row major order.

Table 4.3 PAM250, lower triangle score in row major order.

(54)

Because four processors are done simultaneously, one dimension Scoring Matrix must be a Concurrent Read and Excusive Write PRAM. Therefore, we design CREW PRAM with four multiplexors in Figure 4.8.

Figure 4.8 CREW PRAM is designed by four multiplexors.

4.3.4 Storage

Storage is saved tabular computation from Multiple Processor Units. Storage consists of four rams, four demultiplexors, four counters, three buffers and a signal start save in Figure 4.9. Storage is controlled by Control Unit, so input signal start save that control counter to start is from Control Unit.

(55)

Figure 4.9 Inputs s0, s1, s2, s3 are from Multiple Processor Units, Input start save is from Control Unit.

Figure 4.10 shows four processors calculate scores and save these scores into Storage.

Figure 4.10 Dotted rectangle is Storage component.

(56)

4.3.5 Processor Unit

The Processor Unit is a dedicated processor that is executed dynamic programming alignment. Processor Unit consists of a calculating address, two maximum compare, three adders, nine registers, two multiplexors and a extend6to16.

Figure 4.11 shows our dedicated processor.

Figure 4.11 Our dedicated processor.

4.3.6 Control Unit

Control Unit is controlled when Sequence Flow, Multiple Processor Units and Storage do or stop or interaction each other. Signal Count is count shift register. Set signal Read=1 is read sequence from cache memory. Set signal Shift=1 is start shift

(57)

programming. Check Srfqr=”11111” is represent end of query sequence. Check Srfdb=”11111” is represent end of database sequence. Figure 4.12 shows Control Unit state diagram.

Figure 4.12 Control Unit state diagram.

Finally, PHDPA is simulated by VHDL in Appendix A. For example, align database sequence: WEZVY and query sequence: CEQLRZIF.

Query sequence

C E Q L

W -8 -7 -5 -2 E -5 -4 -4 -2 Z -5 -2 -1 -1 V -2 -2 -1 1 Y -2 -2 -1 1

W -2 2 2 2 2 E -2 2 5 5 5 Z -1 2 5 5 5 V 1 2 5 9 9 Y 1 2 5 9 16

R Z I F

R Z I F 2 2 2 2 2 5 5 5 2 5 5 5 2 5 9 9 2 5 9 16

Database sequence

Rotate

(58)

4.4 Performance Analysis

We compare total clock cycles of our PHDPA with total clock cycles of dynamic programming alignment (DPA) assembly on Pentium x86 CPU and show speedup in Table 4.4. For a given number of query sequence m and number of database sequence n. The formula computing total clock cycles of DPA assembly is

( )

[

113×n +7

]

×m

For a given number of MPU processors p. The formula computing total clock cycles of our PHDPA is

1 p p

n m× + −

PHDPA DPA Assembly

m x n Total clock cycles Total clock cycles Speedup

50 x 50 628 282850 450.3981

100 x 100 2503 1130700 451.7379

150 x 150 5628 2543550 451.9456

200 x 200 10003 4521400 452.0044 250 x 250 15628 7064250 452.0252 300 x 300 22503 10172100 452.0331 350 x 350 30628 13844950 452.0357 400 x 400 40003 18082800 452.0361

Table 4.4 Shows PHDPA speed up with DPA assembly.

(59)

Illustrate Table 4.4, x-axis is m x n number of sequence and y-axis is total clock cycles. The deep blue line is DPA assembly and peak line is our PHDPA in Figure 4.13.

0 2500000 5000000 7500000 10000000 12500000 15000000 17500000 20000000

1 2 3 4 5 6 7 8

number of sequence

total clock cycles

0 100000 200000 300000 400000 500000 DPA Assembly PHDPA

Figure 4.13 Illustrate Table 4.4.

50x50 150x150 250x250 350x350 100x100 200x200 300x300 400x400

m x n number of sequence

(60)

Chapter 5

Conclusion and Future Works

5.1 Conclusion

In this dissertation, (1) Design Object Relation DataBase Machine (ORDBM) with ROPU and PC cluster, ROPU and PC cluster can execute all of the relational algebra and aggregate function operator. (2) Develop a Cluster-based Parallel Search System (CPSS) and m-1 way virtual search tree algorithm. Comparing with m way search tree algorithm, our m-1 way virtual search tree algorithm has better overall performance. (3) Design Parallel Hardware of Dynamic Programming Alignment (PHDPA) to solve timing-consuming of tabular computation. Our PHDPA used for PC cluster with database, then can parallelize to align sequence on database. (4) Design Hardware of Genetic algorithm (HGA) to improve the speed up of solving TSP. HGA possesses both the efficiency of hardware and the flexible of software. Three parameters, including population size, crossover rate, mutation rate, can be adjusted by using software in HGA. In the evolution process, These parameters can quickly converge the result to optimum point by using hardware.

5.2 Future Works

Continuing this work, we are exploring: (1) PCI interface card interact with x86 CPU on PC cluster. (2) Develop PCI interface card driver library using PVM for linux based. (3) User friendly interface execute database mining, search algorithm, dynamic programming and genetic algorithm. (4) Design more bioinformatic algorithm to

(61)

Reference

[1] James W. Lindelien “The value of Accelerated Computing for Bioinformatics”, DeCypher Senior Design Engineer, CTO & Founder, TimeLogic Corportion.

[2] Shin-Hwa Chiou,”Simulate a VLSI chip for the main processor of a relational database machine”, Master Dissertation, Electrical Engineering Graduate School, Chung-Hua Polytechnic Institute, June 1994.

[3] Yun-Shuh Su,”Develop the prototype of a parallel database computer”, Master Dissertation, Electrical Engineering Graduate School, Chung-Hua Polytechnic Institute, June 1994.

[4] A.K.Sood and W.Shu, “Parallel processor implementation of relational database operation, Conf. on Vector Parallel processors in computing Science II, Oxford, England, Aug. 1984.

[5] W. Shu, “ Parallel Computer Architectures for Relational Databases “ Ph. D.

dissertation, Electrical and Computer Engineering, Wayne State University, Detroit, Michigan, May 1985.

[6] Ruey-Ming Shieh, “Develop Language translation system for a relational algebra machine” Master Dissertation, Electrical Engineering Graduate School,

Chung-Hua Polytechnic Institute, June 1995.

[7] A.K.Sood and W.Shu, “A relational algebra machine for fifth generation computers, Conf. on Information Sciences & Systems, Baltimore, Maryland, March 1985.

[8] W. Shu, “Develop the Prototype of a Relational Algebra Machine for Fifth Generation Computer”, 1997 Workshop on Distributed System Technologies &

Application. At Cheng Kung Univ., May 1997, Tainan, Taiwan.

[9] Bayer, R, and Mccreight, C "Organization and maintenance of large ordered indexes," Acta Inf. 1, 3 (1972), 173-189.

(62)

[10] Douglas Comer, “Ubiquitous B-Tree”, ACM Computing Surveys (CSUR), v.11 n.2, p.121-137, June 1979.

[11] D. Lomet. “The Evolution of Effective B-tree: Page Organization and Techniques:

A Personal Account “. ACM SIGMOD Record, 30(3):64-69, Sep. 2001.

[12] Graefe, G. and Larson, P. “B-tree Indexes and CPU Caches”. Intl. Conf. on Data Engineering, Heidelberg (Apr. 2001) 349-358.

[13] Shimin Chen,Phillip B. Gibbons,Todd C. Mowry and Gary Valentin, "Fractal prefetching B+-Trees: optimizing both cache and disk performance", Proceedings of the ACM SIGMOD international conference on Management of data 2002 , Madison, Wisconsin

[14] Peter Bumbulis and Ivan T. Bowman,"A compact B-tree", Proceedings of the ACM SIGMOD international conference on Management of data, 2002 , Madison, Wisconsin

[15] Jun Rao and Kenneth A. Ross, "Making B+- trees cache conscious in main memory", Proceedings of the 2000 ACM SIGMOD international conference on Management of data, pp. 475-486, 2000.

[16] Chang-Hwei Hsu ,“Cluster Based m-1_way Search Algorithm.” Master

Dissertation, Graduate School of Computer Science & Information Engineering Dept., Chung-Hua University, June 2003.

[17] Henikoff,S. and Henikoff,J.G., “Amino acid substitution matrices from protein blocks”. Proc. Natl. Acad. Sci. USA, 89, 10915–10919.

[18] Korostensky C. and Gaston H. Gonnet, “Near Optimal Multiple Sequence Alignments Using a Traveling Salesman Problem Approach”, SPIRE/CRIWG, pp.105-114, 1999.

[19] Gen, M. and R. Cheng, “Genetic Algorithms and Engineering Design”, John Wiley & Sons, INC., Canada, pp. 2,118- 127, 1997.

[20] Tsujimura Y. and M. Gen, “Entropy-based Genetic Algorithm for Solving TSP”,

參考文獻

相關文件

?: {machine learning, data structure, data mining, object oriented programming, artificial intelligence, compiler, architecture, chemistry, textbook, children book,. }. a

?: { machine learning, data structure, data mining, object oriented programming, artificial intelligence, compiler, architecture, chemistry, textbook, children book,. }. a

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 16/22.. If we use E loocv to estimate the performance of a learning algorithm that predicts with the average y value of the

• The Java programming language is based on the virtual machine concept. • A program written in the Java language is translated by a Java compiler into Java

In summary, the main contribution of this paper is to propose a new family of smoothing functions and correct a flaw in an algorithm studied in [13], which is used to guarantee

Programming languages can be used to create programs that control the behavior of a. machine and/or to express algorithms precisely.” -

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics

Adding a Vertex v. Now every vertex zl. Figure 14 makes this more precise. Analysis of the Algorithm. Using the lmc-ordering and the shift-technique, explained in Section