Conclusions and Future Work

Parallel and distributed computing is a good strategy to solve large scale computational intensive problems. Since the idle computing unit will increase the computation time, load balancing is the key to the design of parallel and distributed algorithms. In this dissertation, we used parallel strategy to solve three important problems in Bioinformatics, Cheminformatic, and data mining field. They are minimum ultrametric tree (MUT) construction problem, chemical compound inference (CCI) problem, and frequent pattern mining (FPM) problem respectively. We also develop difference load balancing facilities depending on the problem.

For MUT problem, we designed and implemented a parallel branch-and-bound algorithm named PBBU on master/slave architecture computing system. Two pools, Global Pool and Local Pool, were designed for PBBU to balance the workload among computing nodes. Since centralized Global Pool was used, therefore, PBBU could execute on heterogeneous computing system, e.g., Grid system. It could reduce the communication between slave node and master node by the mechanism of Local Pool. We used random generated distance matrix, Human+Chimpanzee Mitochondrial DNAs, and Bacteriophage T7 DNAs to verify the performance and correctness of PBBU. Comparing the results of Bacteriophage T7, it can be seen the branching orders are correct. For the performance issues, the experimental results show that the PBBU found an optimal solution for 36 species on 16 PCs within a reasonable time. Moreover, the PBBU achieved satisfying speed-up ratios for most of test cases.

77 For CCI problem, we designed and implemented a multi-process branch-and-bound algorithm on multi-core computing system named BMPBB-CCI. In order to efficiently exchange different type of data structure among computing unit, we designed a socket-based manager process. Manager process not only could hold various types of data structures but also could easily be extended to distributed memory computing system. The Kyoto Encyclopedia of Genes and Genomes (KEGG) Compound database was used to validate the performance of BMPBB-CCI. The experimental results showed that the computation time reduced went along with more processes launched. Moreover, proposed algorithm also achieved satisfying speed-up ratio for most of the test cases. In addition, we also reconstructed the NA inhibitors of influenza virus A and used the pharmacophore model to calculate the estimated IC50. The results showed that the inferred compound may be a candidate NA inhibitor for influenza A virus.

For FPM problem, we designed and implemented two parallel algorithms for Cluster system and Grid system named Tidset-based Parallel FP-tree (TPFP-tree) and Balanced Tidset-based Parallel FP-tree (BTP-tree) respectively. Transaction identification set (Tidset) was used to speed up the exchanging transactions by direct selecting transactions instead of scanning database. Since Grid system is a heterogeneous computing system, performance index was designed in BTP-tree and used to measure the computing capabilities of given dataset. We adopt the data generated by IBM data generator to verify the performance of proposed frequent pattern mining algorithm. The experimental results showed TPFP-tree can reduce the execution time related to PFP-tree on PC cluster when the database size grows.

Moreover, BTP-tree can shorten the execution time significantly and has better loading balance capability than TPFP-tree and PFP-tree on multi-cluster grid.

The characteristics of each proposed solution was shown in Table 6-1.

Table 6-1: Characteristics of MUT, CCI, and FPM

MUT CCI FPM

Parallel Strategy

Cluster  

Grid  

Multi-core 

Method Strategy

Branch and bound  

Load balancing 1. Global Pool 2. Local Pool

1. Global Queue

2. Local Queue Performance Index Programming

Model

MPI  

Multi-process 

Verification

Data

1. Random 2. 135 Human + one Chimpanzee Mitochondrial DNAs

1. KEGG 2. NA inhibitor

IBM synthetic data generator

Speed-up   

Correctness   

In the future, we plan to work on the following directions:

For MUT problem: (1) Improving the performance of the PBBU by adding other strategies to decrease the communication cost or by pruning the unnecessary branching nodes.

(2) Analyzing the relationships between the execution time of the PBBU and input distance matrices. This can help to choose suitable strategies for various test cases.

For CCI problem: (1) Extending BMPBB-CCI algorithm to Cluster and Grid computing system. (2) Improving the performance of BMPBB-CCI by adding other strategies to reduce the tree searching spaces.

For FPM problem: (1) Improving the degree of accuracy of performance index to balance the workload among computing units better. (2) Analyzing the relationship between execution time and input dataset.

References

[1] "TOP500 Supercomputer Sites," 2010; http://www.top500.org/.

[2] "KEGG LIGAND database," 2009; http://www.genome.jp/kegg/ligand.html.

[3] "OpenMP," 2009; http://openmp.org/.

[4] J. Adams, W. Brainerd, J. Martin, B. Smith, and J. Wagener, “Fortran 90 Handbook,”

Intertext-McGraw Hill, 1992.

[5] R. Agrawal, and R. Srikant, “Fast algorithms for mining association rules,” in International Conference on Very Large Data Bases, 1994, pp. 487-499.

[6] R. Agrawal, and R. Srikant, "Quest Synthetic Data Generator. IBM Almaden Research Center, San Jose, California," 2009.

[7] T. Akutsu, and D. Fukagawa, "Inferring a Graph from Path Frequency,"

Combinatorial Pattern Matching, pp. 371-382, 2005.

[8] H. Bandelt, “Recognition of tree metrics,” SIAM Journal on Discrete Mathematics, vol. 3, pp. 1-6, 1990.

[9] B. Buchanan, and E. Feigenbaum, “DENDRAL and Meta-DENDRAL: Their Applications Dimension,” Artificial Intelligence, vol. 11, pp. 5-24, 1978.

[10] L. Cavalli-Sforza, and A. Edwards, “Phylogenetic analysis. Models and estimation procedures,” American Journal of Human Genetics, vol. 19, no. 3 Pt 1, pp. 233-257, 1967.

[11] F. Chen, and W. Li, “Genomic divergences between humans and other hominoids and the effective population size of the common ancestor of humans and chimpanzees,”

The American Journal of Human Genetics, vol. 68, no. 2, pp. 444-456, 2001.

[12] H. Chen, and M. Chang, “An efficient exact algorithm for the minimum ultrametric tree problem,” Lecture Notes in Computer Science, vol. 3341, pp. 282-293, 2004.

[13] M. Chen, C. Huang, K. Chen, and H. Wu, “Aggregation of orders in distribution centers using data mining,” Expert Systems with Applications, vol. 28, no. 3, pp. 453-460, 2005.

[14] F. Coenen, P. Leng, and S. Ahmed, “Data structure for association rule mining: T-trees and P-T-trees,” IEEE Transactions on Knowledge and Data Engineering, pp. 774-778, 2004.

[15] E. Dahlhaus, “Fast parallel recognition of ultrametrics and tree metrics,” SIAM Journal on Discrete Mathematics, vol. 6, pp. 523-532, 1993.

[16] W. Day, “Computationally difficult parsimony problems in phylogenetic systematics,” Journal of theoretical biology, vol. 103, pp. 429-438, 1983.

[17] W. Day, “Computational complexity of inferring phylogenies from dissimilarity matrices,” Bulletin of Mathematical Biology, vol. 49, no. 4, pp. 461-467, 1987.

[18] W. Day, D. Johnson, and D. Sankoff, “The computational complexity of inferring rooted phylogenies by parsimony,” Mathematical biosciences, vol. 81, no. 33-42, pp.

299, 1986.

[19] M. Deshpande, M. Kuramochi, N. Wale, and G. Karypis, “Frequent substructure-based approaches for classifying chemical compounds,” IEEE Transactions on Knowledge and Data Engineering, vol. 17, no. 8, pp. 1036-1050, 2005.

[20] A. Drummond, and A. Rodrigo, “Reconstructing genealogies of serial samples under the assumption of a molecular clock using serial-sample UPGMA,” Molecular Biology and Evolution, vol. 17, no. 12, pp. 1807-1815, 2000.

80 [21] R. Edgar, “MUSCLE: multiple sequence alignment with high accuracy and high

throughput,” Nucleic acids research, vol. 32, no. 5, pp. 1792-1797, 2004.

[22] O. El-Dessouki, and W. Huen, “Distributed enumeration on between computers,”

IEEE Transactions on Computers, vol. 100, no. 29, pp. 818-825, 1980.

[23] M. Farach, S. Kannan, and T. Warnow, “A robust model for finding optimal evolutionary trees,” Algorithmica, vol. 13, no. 1, pp. 155-179, 1995.

[24] J. Faulon, C. Churchwell, and D. Visco Jr, “The signature molecular descriptor. 2.

Enumerating molecules from their extended valence sequences,” J. Chem. Inf.

Comput. Sci, vol. 43, no. 3, pp. 721-734, 2003.

[25] T. Fink, and J. Reymond, “Virtual Exploration of the Chemical Universe up to 11 Atoms of C, N, O, F: Assembly of 26.4 Million Structures (110.9 Million Stereoisomers) and Analysis for New Ring Systems, Stereochemistry,” Journal of Chemical Information and Modeling, vol. 47, no. 2, pp. 342-353, 2007.

[26] W. Fitch, and E. Margoliash, “Construction of phylogenetic trees,” Science, vol. 155, no. 760, pp. 279-284, 1967.

[27] I. Foster, and C. Kesselman, The grid: blueprint for a new computing infrastructure:

Morgan Kaufmann, 2004.

[28] L. Foulds, “Maximum savings in the Steiner problem in phylogeny,” Journal of theoretical biology, vol. 107, no. 3, pp. 471-474, 1984.

[29] L. Foulds, and R. Graham, “The Steiner problem in phylogeny is NP-complete,”

Advances in Applied Mathematics, vol. 3, no. 43-49, pp. 299, 1982.

[30] H. Fujiwara, J. Wang, L. Zhao, H. Nagamochi, and T. Akutsu, “Enumerating Treelike Chemical Graphs with Given Path Frequency,” Journal of Chemical Information and Modeling, vol. 48, no. 7, pp. 1345-1357, 2008.

[31] K. Funatsu, and S. Sasaki, “Recent advances in the automated structure elucidation system, chemics. utilization of two-dimensional nmr spectral information and development of peripheral functions for examination of candidates,” J. Chem. Inf.

Comput. Sci, vol. 36, no. 2, pp. 190-204, 1996.

[32] V. Gorodetsky, O. Karsaeyv, and V. Samoilov, “Multi-agent technology for distributed data mining and classification,” in International Conference on Intelligenet Agent Technology, 2003, pp. 438-441.

[33] D. Gusfield, “Algorithms on Stings, Trees, and Sequences: Computer Science and Computational Biology,” ACM SIGACT News, vol. 28, no. 4, pp. 41-60, 1997.

[34] L. Hall, R. Dailey, and L. Kier, “Design of molecules from quantitative structure-activity relationship models. 3. Role of higher order path counts: path 3,” Journal of Chemical Information and Computer Sciences, vol. 33, no. 4, pp. 598-603, 1993.

[35] J. Han, J. Pei, Y. Yin, and R. Mao, “Mining frequent patterns without candidate generation: A frequent-pattern tree approach,” Data Mining and Knowledge Discovery, vol. 8, no. 1, pp. 53-87, 2004.

[36] M. Hendy, and D. Penny, “Branch and bound algorithms to determine minimal evolutionary trees,” Mathematical biosciences, vol. 59, no. 2, pp. 277-290, 1982.

[37] D. Higgins, and P. Sharp, “CLUSTAL: a package for performing multiple sequence alignment on a microcomputer,” Gene(Amsterdam), vol. 73, no. 1, pp. 237-244, 1988.

[38] D. Hillis, J. Bull, M. White, M. Badgett, and I. Molineux, “Experimental phylogenetics: generation of a known phylogeny,” Science, vol. 255, no. 5044, pp.

589-592, 1992.

[39] J. Holt, and S. Chung, “Parallel mining of association rules from text databases on a cluster of workstations,” in Parallel and Distributed Processing Symposium, 2004, pp.

86-95.

81 [40] T. Hong, C. Lin, and Y. Wu, “Incrementally fast updated frequent pattern trees,”

Expert Systems with Applications, vol. 34, no. 4, pp. 2424-2435, 2008.

[41] P. Iko, and M. Kitsuregawa, “Shared Nothing Parallel Execution of FP-growth,”

DBSJ Letters, vol. 2, no. 1, pp. 43–46, 2003.

[42] M. Ingman, H. Kaessmann , S. Paabo , and U. Gyllensten, “Mitochondrial genome variation and the origin of modern humans,” Nature, vol. 408, pp. 708-713, 2000.

[43] V. Janakiram, E. Gehringer, D. Agrawal, and R. Mehrotra, “A randomized parallel branch-and-bound algorithm,” International Journal of Parallel Programming, vol.

17, no. 3, pp. 277-301, 1988.

[44] A. Javed, and A. Khokhar, “Frequent pattern mining on message passing multiprocessor systems,” Distributed and Parallel Databases, vol. 16, no. 3, pp. 321-334, 2004.

[45] C. Jordan, “Sur les assemblages de lignes,” J. Reine Angew. Math, vol. 1869, no. 70, pp. 185-190, 1869.

[46] R. Karp, and Y. Zhang, “A randomized parallel branch-and-bound procedure,” in Annual ACM Symposium on Theory of Computing, 1988, pp. 290-300.

[47] H. Kashima, K. Tsuda, and A. Inokuchi, “Marginalized kernels between labeled graphs,” in International Conference on Machine Learning, 2003, pp. 321-328.

[48] M. Krivanek, “The complexity of ultrametric partitions on graphs,” Inform. Proc.

Letters, vol. 27, pp. 265-270, 1988.

[49] E. Lawler, and D. Wood, “Branch-and-bound methods: A survey,” Operations Research, vol. 14, no. 4, pp. 699-719, 1966.

[50] E. Lazcorreta, F. Botella, and A. Fernández-Caballero, “Towards personalized recommendation by two-step modified Apriori data mining algorithm,” Expert Systems with Applications, vol. 35, no. 3, pp. 1422-1429, 2008.

[51] T. Li, S. Zhu, and M. Ogihara, “A new distributed data mining model based on similarity,” in Symposium on Applied Computing, 2003, pp. 432-436.

[52] W. Li, Molecular evolution: Sunderland Sinauer, 1997.

[53] W. Li, and D. Graur, “Fundamentals of molecular evolution,” Sunderland, Massachusetts: Sinauer, 1991.

[54] C. Lin, T. Hong, and W. Lu, “The Pre-FUFP algorithm for incremental mining,”

Expert Systems with Applications, vol. 36, no. 5, pp. 9498-9505, 2009.

[55] C. Lin, C. Lee, M. Chen, and P. Yu, “Distributed data mining in a chain store database of short transactions,” in International Conference on Knowledge Discovery and Data Mining, 2002, pp. 576-581.

[56] H. Mauser, and M. Stahl, “Chemical fragment spaces for de novo design,” J. Chem.

Inf. Model, vol. 47, no. 2, pp. 318-324, 2007.

[57] G. Moore, “Cramming more components onto integrated circuits,” Proceedings of the IEEE, vol. 86, no. 1, pp. 82-85, 1998.

[58] S. Nakano, and T. Uno, “Generating colored trees,” Lecture Notes in Computer Science, vol. 3787, pp. 249-260, 2005.

[59] J. Park, M. Chen, and P. Yu, “An effective hash-based algorithm for mining association rules,” ACM SIGMOD Record, vol. 24, no. 2, pp. 175-186, 1995.

[60] I. Pramudiono, and M. Kitsuregawa, “Parallel FP-growth on PC cluster,” Lecture Notes in Computer Science, pp. 467-473, 2003.

[61] M. Quinn, “Analysis and implementation of branch-and-bound algorithms on ahypercube multicomputer,” IEEE Transactions on Computers, vol. 39, no. 3, pp.

384-387, 1990.

[62] V. Rao, and V. Kumar, “Parallel depth first search. Part I. implementation,”

International Journal of Parallel Programming, vol. 16, no. 6, pp. 479-499, 1987.

在文檔中 Design Parallel Algorithms for Ultrametric Tree Construction, Chemical Compound Inference, and Frequent Pattern Mining on (頁 87-93)