解決關聯式代數運算組合最佳化用生物計算

(1)

行政院國家科學委員會專題研究計畫成果報告

解決關聯式代數運算組合最佳化用生物計算

研究成果報告(精簡版)

計畫類別：個別型計畫編號： NSC 95-2221-E-151-034- 執行期間： 95 年 08 月 01 日至 96 年 07 月 31 日執行單位：國立高雄應用科技大學資訊工程研究所計畫主持人：張雲龍計畫參與人員：博士班研究生-兼任助理：施能裕大學生-兼任助理：陳柏樺、陳佩玲、董耀文、邵鏡昇處理方式：本計畫可公開查詢

中華民國 96 年 10 月 08 日

(2)

行政院國家科學委員會補助專題研究計畫成果

解決關聯式代數運算組合最佳化用生物計算

計畫類別：個別型計畫計畫編號：NSC 95-2221-E-151-034-執行期間：2006.08.01 至 2007.07.31 計畫主持人：張雲龍共同主持人：陳碧慧計畫參與人員：施能裕，陳柏樺，陳佩玲，董耀文，邵鏡昇成果報告類型(依經費核定清單規定繳交)：精簡報告處理方式：立即公開查詢執行單位：國立高雄應用科技大學資訊工程系中華民國 9 6 年 1 0 月 1 日

(3)

（一）計劃中文摘要

關鍵詞：邏輯層次的最佳化，實體層次的最佳化，查詢最佳化，生物關聯式代數運算最佳化，生物關聯式資料庫最佳化，生物計算。 Codd 提出第一篇文章討論一種關聯式資料庫模式，Adleman 使用生物指令來解決漢米爾頓路徑問題(著名的組合最佳化問題)，從[2]，在生物的資訊儲存密度是每一個奈米三次方儲存一個位元，一個磁帶的資訊儲存密度是每 1012 奈米三次方儲存一個位元。這個計畫證明生物指令可以被用來建構生物關聯式資料庫，在關聯式表格中每一個資料記錄皆可被生物指令加以編碼，為了達到這個目的，生物指令演算法被提出來完成關聯式資料庫的八個運算，這八個運算包含＂Cartesian product ＂， “union＂， “set difference＂，“selection＂， “projection＂， “intersection ＂，＂ join＂和“division＂。更進一步的說，這個計畫利用生物指令完成(1) 邏輯層次的最佳化，(2) 實體層次的最佳化，(3) 查詢最佳化， (4) 生物關聯式代數運算最佳化， (5) 生物關聯式資料庫最佳化，及證明生物計算有能力完成資料存取指令在生物關聯式資料庫上面。

(4)

(一) 英文摘要

Keywords: Optimization of Logical Level, Optimization of Physical Level, Optimization of Query,

Optimization of Biological Relational Algebra Operations, Optimization of Biological Relational Database, Biological Computation.

Codd wrote the first paper in which the model of a relational database was proposed. Adleman wrote the first paper in which biological operations were used to solve an instance of the Hamiltonian path problem (the famous combinational optimization problem). From [2], it is obviously indicated that biological storing information allows for an information density of approximately 1 bit per cubic nm (nanometer) and a dramatic improvement over existing storage media such as video tape which store information at a density of approximately 1 bit per 1012 cubic nanometers. This project demonstrates that biological operations can be applied to construct biological databases where data records in relational tables are encoded through biological operations. In order to achieve the goal, biological operational algorithms are proposed to perform eight operations of relational algebra (calculus) on biological relational databases, which include Cartesian

product, union, set difference, selection, projection, intersection, join and division. Furthermore, this work

uses biological operations to perform (1) Optimization of Logical Level, (2) Optimization of Physical Level, (3) Optimization of Query, (4) Optimization of Biological Relational Algebra Operations, and (5) Optimization of Biological Relational Database and also presents clear evidence of the ability of biological computation to perform data retrieval operations on biological relational databases.

(5)

報告內容

1. Introduction

In 1970, Codd [1] wrote the first paper regarding a relational database. Adleman [2] succeeded in solving an instance of the Hamiltonian path problem in a test tube by handling DNA strands in 1994. Since then, many DNA-based algorithms have been offered to solve NP-complete or NP-hard problems such as the set-splitting problem [3].

The paper is organized as follows. Section 2 introduces DNA manipulations. Section 3 introduces DNA-based algorithms to perform relational operations, which include Cartesian product, union, set difference, selection, projection, intersection, join and division. Finally, conclusions are made in Section 4.

2. Introduction of Background

DNA (deoxyribonucleic acid) is the molecule that plays the main role in DNA- based computing [2]. Nucleotides are the structural units of DNA. In the most common nucleotides the base is a derivative of purine or pyrimidine, and five-carbon sugar. Purines include adenine and guanine, abbreviated A and G. Pyrimidines contain cytosine and thymine, abbreviated C and T. In this section several available DNA techniques are used to perform relational operations on bio-molecular relational databases.

2.1. DNA Manipulations

A (test) tube is a set of molecules of DNA (a multi-set of finite strings over the alphabet {A, C, G, T}) that can perform the following operations:

1. Extract. Given a tube P and a short single strand of DNA, S, the operation produces two tubes +(P, S) and −(P, S), where +(P, S) is all of the molecules of DNA in P which contain S as a sub-strand and −(P,

S) is all of the molecules of DNA in P which do not contain S..

2. Merge. Given tubes P1 and P2, yield ∪(P1, P2), where ∪(P1, P2) = P1 ∪ P2. This operation is to pour two

tubes into one, without any change in the individual strands.

3. Detect. Given a tube P, if P includes at least one DNA molecule we have ‘yes’, and if P contains no DNA molecule we have ‘no’.

4. Discard. Given a tube P, the operation will discard P.

5. Amplify. Given a tube P, the operation, Amplify(P, P1, P2), will produce two new identical copies, P1and

P2,of tube P, and then P becomes empty.

6. Append. Given a tube P containing a short strand of DNA, Z, the operation will append Z onto the end of every strand in P.

7. Append-head. Given a tube P containing a short strand of DNA, Z, the operation will append Z onto the head of every strand in P.

8. Read. Given a tube P, the operation is used to describe a single molecule, which is contained in tube P. Even if P contains many different molecules each encoding a different set of bases, the operation can give an explicit description of exactly one of them.

(6)

3. Constructing Bio-Molecular Relational Databases 3.1. Cartesian Product on Bio-molecular Databases

More concisely, R is a subset of the Cartesian product S1 × S2 × … × Sn. Assume that R is equal to {(ri, 1,

… ri, n)|ri, k ∈ Sk for 1 ≤ k ≤ n and 1 ≤ i ≤ m}. The value encoding ri, k in R can be represented as a binary

number, vi, k, 1 … vi, k, l for 1 ≤ l ≤ Lk, 1 ≤ k ≤ n and 1 ≤ i ≤ m. By calling Insert(T80, i), CartesianProduct(T0, m)

implements an append operation to insert one record into R.

Procedure Insert(T80, i) (1) For k = 1 to n (2) For j = 1 to Lk; Append(T80, vi, k, j); EndFor EndFor EndProcedure Procedure CartesianProduct(T0, m)

(1) For i = 1 to m; Insert(T80, i); T0 = ∪(T0, T80); EndFor

EndProcedure

3.2. Set Operations on Bio-molecular Databases

Assume that X and Y are n-ary relations with p and q elements respectively. Tube T1 has p DNA

sequences representing p records in X and tube T2 has q DNA sequences representing q records in Y. Three

following DNA algorithms implement union, intersection or set difference operations on X and Y.

Procedure Union(T1, T2, T3, p) (1) Amplify(T1, T11, T12); Amplify(T2, T21, T22). (2) T1 = ∪(T1, T11); T2 = ∪(T2, T21). (3) For i = 1 to p (4) For k = 1 to n (5) For j = 1 to Lk

(5a) T22 = +(T22, vi, k, j) and T22OFF = −(T22, vi, k, j). (5b) T22ON = ∪(T22ON, T22OFF). EndFor EndFor (5c) Discard(T22); T22 = ∪(T22, T22ON). EndFor (6) T3 = ∪( T12, T22). EndProcedure Procedure Intersection(T1, T2, T4, p) (1) Amplify(T2, T21, T22); T2 = ∪(T2, T21). 2

(7)

(2) For i = 1 to p (3) For k = 1 to n (4) For j = 1 to Lk

(4a) T22 = +(T22, vi, k, j) and T22OFF = −(T22, vi, k, j). (4b) T22ON = ∪(T22ON, T22OFF). EndFor EndFor T4 = ∪(T4, T22); T22 = ∪(T22, T22ON). EndFor (5) Discard(T22). EndProcedure Procedure Difference(T1, T2, T5, q) (1) Amplify(T1, T11, T12); Amplify(T2, T21, T22). (2) T1 = ∪(T1, T11); T2 = ∪(T2, T21). (3) For i = 1 to q (4) For k = 1 to n (5) For j = 1 to Lk

(5a) T12 = +(T12, vi, k, j) and T12OFF = −(T12, vi, k, j). (5b) T12ON = ∪(T12ON, T12OFF).

(5c) T22 = +(T22, vi, k, j) and T22OFF = −(T22, vi, k, j). (5d) T22ON = ∪(T22ON, T22OFF).

EndFor EndFor

(3a) If (Detect(T22) = ‘yes’) then Discard(T12) EndIf

(3c) T12 = ∪(T12, T12ON).

(3d) T22 = ∪(T22, T22ON).

EndFor

(6) T5 = ∪(T5, T12).

EndProcedure

3.3. Projection Operator on Bio-molecular Databases

Algorithm JudgeDintinctElement takes amplify, merge, extract and detection to eliminate duplications of projection operator on any n-ary relation.

Procedure JudgeDintinctElement(T6, T9, T6>, T6=, T6<).

(1) Amplify(T6, T6ON, T6OFF); T6 = ∪(T6, T6ON).

(2) For d = 1 to c (3) For j = 1 to Lk

(8)

(3a) T9ON = +(T9, vi, k, j1) and T9OFF = −(T9, vi, k, j1), where the specific dth column for R corresponds to the kth domain.

(3b) T7ON = +(T6OFF, vi, k, j1) and T7OFF = −(T6OFF, vi, k, j1). (3c) If (Detect(T9ON = ‘yes’) then

T6= = ∪(T6=, T7ON); T6< = ∪(T6<, T7OFF); T9 = ∪(T9, T9ON). Else T6> = ∪(T6>, T7ON); T6= = ∪(T6=, T7OFF); T9 = ∪(T9, T9OFF). EndIf (3d) T6OFF = ∪(T6OFF, T9=). EndFor EndFor EndProcedure Procedure Projection(T0, T6, m, c) (1) Amplify(T0, T7, T8); T0 = ∪(T0, T7). (2) For i = 1 to m (3) For d = 1 to c (4) For j = 1 to Lk

(4a) T8 = +(T8, vi, k, j) and T8OFF = −(T8, vi, k, j), where the specific dth column for R corresponds to the

kth domain.

(4b) T8ON = ∪(T8ON, T8OFF).

EndFor

(5) If (Detect(T8) = ‘yes’) then

(6) For j = 1 to Lk

(6a) Append(T9, vi, k, j), where the specific dth column for R corresponds to the kth domain.

EndFor EndIf

(7) T8 = ∪(T8, T8ON).

EndFor

(8) If (Detect(T9) = ‘yes’) then

(9) If (Detect(T6) = ‘yes’ then

(10) JudgeDintinctElement(T6, T9, T6>, T6=, T6<).

(11) If (Detect(T6= = ‘no’) then

(12) T6 = ∪(T6, T9). Else (13) Discard(T9). EndIf Else (14) T6 = ∪(T6, T9). EndIf EndFor 4

(9)

EndProcedure

3.4. Selection Operator on Bio-molecular Databases

Algorithm Selection takes amplify, merge, extract, append and detection operations performing

selection operator on any n-ary relation.

Procedure Selection(T0, T16>, T16=, T16<, T16≠, T16≥, T16≤, D, E)

(1) Amplify(T0, T13, T14); T0 = ∪(T0, T13).

(2) For j = 1 to Lk ; Append(T15, ej); EndFor

(3) For j = 1 to Lk

(3a) T15ON = +(T15, ej1) and T15OFF = −(T15, ej1). (3b) T14ON = +(T14, vi, k, j1) and T14OFF = −(T14, vi, k, j1). (3c) If (Detect(T15ON = ‘yes’) then

T9= = ∪(T9=, T14ON); T9< = ∪(T9<, T14OFF); T15 = ∪(T15, T15ON). Else T9> = ∪(T9>, T14ON); T9= = ∪(T9=, T14OFF); T15 = ∪(T15, T15OFF). EndIf (3d) T14 = ∪(T14, T9=). EndFor (4) T9= = ∪(T9=, T14).

(5) Amplify(T9>, T16>, T17>); Amplify(T9=, T16=, T17=); Amplify(T9<, T16<, T17<).

(6) Amplify(T17>, T18>, T19>); Amplify(T17=, T18=, T19=); Amplify(T17<, T18<, T19<).

(7) T16>= = ∪(T18>, T18=); T16<= = ∪(T18<, T19=); T16≠ = ∪(T19>, T19<).

EndProcedure

3.5. Theta-Join Operator on Bio-molecular Databases

Algorithm CartesianProductTwoRelations takes extract, merge and append operations on any two n-ary relations, R1 and R2, to perform R1 × R2 through tubes T51 and T52. The theta-join operator on R1 and R2 is

performed through the Cartesian product and the selection operator.

Procedure CartesianProductTwoRelations(T51, T52)

(1) For i = 1 to q (2) For k = 1 to n (3) For j = 1 to Lk

(3a) T52ON = +(T52, vi, k, j1) and T52OFF = −(T52, vi, k, j1).

(3b) If (Detect(T52ON = ‘yes’) then Append(T51, vi, k, j1) EndIf

(3c) If (Detect(T52OFF = ‘yes’) then Append(T51, vi, k, j0) EndIf

(3d) T52 = ∪(T52ON, T52OFF).

(10)

EndFor Procedure Theta-join(T50) (1) CartesianProduct(T53, p); CartesianProduct(T54, q). (2) Amplify(T53, T51, T55); Amplify(T54, T52, T56). (3) T53 = ∪(T53, T55); T54 = ∪(T54, T56). (4) CartesianProductTwoRelations(T51, T52). (5) Selection(T51, T51>, T51=, T51<, T51≠, T51≥, T51≤, D, E); T50 = ∪(T50, T51). EndProcedure

3.6. Division Operator on Bio-molecular Databases

The division operator on any two n-ary relations, R1 and R2, is done through projection and difference operators, and CartesianProduct.

Procedure Division(T60) (1) CartesianProduct(T63, p); CartesianProduct(T64, q). (2) Amplify(T63, T67, T65); Amplify(T64, T68, T66). (3) T63 = ∪(T63, T65); T64 = ∪(T64, T66). (4) Projection(T67, T61, p, w); CartesianProductTwoRelations(T61, T66). (5) Difference(T61, T67, T69, p); Projection(T69, T70, p * q, w). (6) Projection(T67, T71, p, w); Difference(T71, T70, T60, p * q). EndProcedure 4. Conclusions

Success of a biological operation relies on the assumption that no accidental bonds can be formed between molecules in the tube before the operation is initiated, or even during the operation. This paper demonstrates eight fundamental relational algebra operations (Cartesian product, union, set difference, selection, projection, intersection, join and division) that can be performed on a bio-molecular database. That is to say that the problem of exponential growth for the capability of information processing can be solved with bio-molecular databases. It is possible that in the future molecular computers will be the clear choice for performing massively parallel computations and storing very large information.

References

1. E. F. Codd. “A Relational Model of Data for Large Shared Data Banks”. Communication of the ACM, Volume 13, Number 6, pp. 377-387 (1970).

2. L. Adleman. “Molecular computation of solutions to combinatorial problems”. Science, 266:1021-1024 (1994).

3. Weng-Long Chang, Minyi Guo and Michael Ho, “Towards solution of the set-splitting problem on gel-based DNA computing", Future Generation Computer Systems, Volume: 20, Issue: 5, pp. 875-885 (2004).

(11)

計畫成果自評

本計畫的內容已發表於JCIS2007 中的8th International Conference on Natural Computing (NC)，標題為CONSTRUCTING BIO-MOLECULAR DATABASES ON A DNA-BASED COMPUTER 一篇.。

本計劃已利用基本的生物指令，在生物分子的關聯式資料庫，完成:(1) Cartesian product， (2)union，(3) intersection，(4)set difference， (5) selection，(6) projection，(7)join， (8)division 等八項功能。奈米層級的關聯式資料庫系統，在未來將是大量資訊儲存媒體的另外一種最重要的選擇。

解決關聯式代數運算組合最佳化用生物計算

行政院國家科學委員會專題研究計畫 成果報告

解決關聯式代數運算組合最佳化用生物計算

研究成果報告(精簡版)

中 華 民 國 96 年 10 月 08 日

行政院國家科學委員會補助專題研究計畫成果

解決關聯式代數運算組合最佳化用生物計算

（一） 計劃中文摘要

(一) 英文摘要

報告內容

計畫成果自評

行政院國家科學委員會專題研究計畫成果報告

中華民國 96 年 10 月 08 日

（一）計劃中文摘要