In this thesis, we analyzed the state-of-the-art graph mining algorithm gSpan and
addressed the possible inefficiencies in it. We found that when gSpan deals with large
and complex graphs (denser graphs with fewer labels available), especially when
deals with unlabeled graphs, the performance of gSpan degrades. Based on gSpan, we
proposed a new graph enumeration method, which reduces the candidate generation.
There are still some research issues of our proposed algorithm. First, we have to
prove the completeness of our proposed algorithm. Second, there might be a better
algorithm to calculate the minimum code when the graph is unlabeled. Third, in order
to extend our algorithm to mine labeled graphs, we might require a new lexicographic
order. Therefore, to develop a new lexicographic order for our proposed algorithm is a
research issue in our future work.
Bibliography
[1] R. Agrawal and R. Srikant. “Fast algorithms for mining association rules”, in Proc. 1994 Int. Conf. Very Large Data Bases (VLDB’94), pp.487-499, Santiago, Chile, September 1994.
[2] R. Agrawal and R. Srikant. “Mining sequential patterns”, in Proc. 1995 Int. Conf.
Data Engineering (ICDE’95), pp.3-14, Taipei, Taiwan, March 1995.
[3] T. Asai, K. Abe, S. Kawasoe, H. Arimura, H. Satamoto, and S. Arikawa.
“Efficient substructure discovery from largesemi-structured data”, in Proc. 2002 SIAM Int. Conf. Data Mining, Arlington, VA, April 2002.
[4] R. Attias and J. E. Dubois. “Substructure systems: concepts and classifications”, Journal of Chemical Information and Computer Sciences, Volume 30, pp.2-7 1990.
[5] D. M. Bayada, R. W. Simpson, and A. P. Johnson. “An algorithm for the multiple common subgraph problem”, Journal of Chemical Information and Computer Science, Volume 32, pp.680-685, 1992.
[6] R. J. Bayardo. “Efficiently mining long patterns from databases”, in Proc. 1998 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD’98), pp.85-93, Seattle, WA, June 1998.
[7] D. Burdick, M. Calimlim, and J. Gehrke. MAFIA: “A maximal frequent itemset algorithm for transactional databases”, in Proc. 2001 Int. Conf. Data Engineering (ICDE’01), pp.443-452, Heidelberg, Germany, April 2001.
[8] L. P. Cordella, P. Foggia, C. Sansone, and M. Vento. “Graph matching: A fast algorithm and its evaluation”, in Proceedings of the 14th Int. Conf. on Pattern Recognition(ICPR-16), pp.1582-1584, August 1998.
[9] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms, MIT Press, 2001 Second Edition.
[10] L. Dehaspe, H. Toivonen, and R. King. “Finding frequent substructures in chemical compounds”, in Proc. 1998 Int. Conf. Knowledge Discovery and Data
Mining (KDD’98), pp.30-36, New York, August. 1998.
[11] S. Fortin. “The graph isomorphism problem”, Technical Report TR96-20, Department of Computing Science, University of Alberta, July 1996.
[12] B. Liu, G. Cong, L. Yi, and K. Wang. “Discovering frequent substructures from hierarchical semi-structured data”, in Proc. 2002 SIAM Int. Conf. Data Mining, Arlington, VA, April 2002.
[13] J. Han, J. Pei, and Y. Yin. “Mining frequent patterns without candidate generation”, in Proc. 2000 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD’00), pp.1-12, Dallas, TX, May 2000.
[14] L. B. Holder, D. J. Cook, and S. Djoko. “Substructure discovery in the subdue system”, in Proc. AAAI’94 Workshop Knowledge Discovery in Database (KDD’94), pp.169-180, Seattle, WA, July 1994.
[15] A. Inokuchi, T. Washio, and H. Motoda. “An apriori-based algorithm for mining frequent substructures from graph data”, in Proc. of the 4th European Conf. on Principles and Practice of Knowledge Discovery in Databases ( PKDD’00), pp.13-23, Lyon, France, September 2000.
[16] M. Kuramochi and G. Karypis. “Frequent subgraph discovery”, in Proc. 2001 Int.
Conf. Data Mining (ICDM’01), pp.313-320, San Jose, CA, November 2001.
[17] H. Mannila, H. Toivonen, and A. I. Verkamo. “Discovery of frequent episodes in event sequences”, Data Mining and Knowledge Discovery, pp.259-289, 1997.
[18] B. D. McKay. “Practical graph isomorphism”, Congressus Numerantium, pp.45-97, 1981
[19] J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, and M.-C. Hsu.
“PrefixSapn: Mining sequential patterns efficiently by prefix-projected pattern growth”, in Proc. 2001 Int. Conf. Data Engineering (ICDE’01), pp.215-224, Heidelberg, Germany, April 2001.
[20] K. Shearer, H. Bunke, and S. Venkatesh. “Video indexing and similarity retrieval by largest common subgraph detection using decision trees”, Pattern
Recognition, pp.1075-1091, 2001.
[21] P. Shenoy, J. R. Haritsa, S. Sudarshan, G. Bhalotia, M. Bawa, and D. Shah.
“Turbo-charging vertical mining of large databases”, in Proc. 2000 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD’00), pp.22-33, Dallas, TX, May 2000.
[22] S. Su, D. J. Cook, and L. B. Holder. “Knowledge discovery in molecular biology:
Identifying structural regularities in proteins”, Intelligent Data Analysis, pp.413-436, 1999.
[23] Y. Takahashi, Y. Satoh, and S. Sasaki. “Recognition of largest common fragment among a variety of chemical structures”, Analytical Science, pp.23-28, 1987.
[24] J. R. Ullmann. “An algorithm for subgraph isomorphism”, Journal of the ACM, pp.31-42, 1976.
[25] E. K. Wong. “Model matching in robot vision by subgraph isomorphism”, Pattern Recognition, pp.287-304, 1992.
[26] M. J. Zaki. “Efficiently mining frequent trees in a forest”, in Proc. of the 2002 Conf. on Knowledge Discovery and Data Mining (SIGKDD’02), 2002.
[27] M. J. Zaki and C. J. Hsiao. “CHARM: An efficient algorithm for closed itemset mining”, in Proc. 2002 SIAM Int. Conf. Data Mining, pp.457-473, Arlington, VA, April 2002.
[28] X. Yan and J. Han. “gSpan: Graph-based substructure pattern mining”, in Proc.
2002 Int. Conf. Data Mining (ICDM’02), pp.721-724, 2002 .
[29] X. Yan and J. Han. “Closegraph: Mining closed frequent graph patterns”, in Proc.
of the 2003 Conf. on Knowledge Discovery and Data Mining (SIGKDD’03), 2003.
[30] S. Nijssen, J. N. Kok. “A quickstart in frequent structure mining can make a difference”, in Proc. of the 2004 Conf. on Knowledge Discovery in Databases (KDD’04), pp.647–652, Seattle, WA, 2004.
[31] M. R. Garey and D. S. Johnson. “Computers and intractability: A guide to the theory of NP-completeness”, New York: W. H. Freeman, 1979.
[32] L. Dehaspe and H. Toivonen . “Discovery of frequent datalog patterns”, Data Mining and Knowledge Discovery, pp.7-36, 1999.
[33] C. Borgelt and M. R. Berhold. “Mining molecular fragments: Finding relevant substructures of molecules”, in Proc. 2002 Int. Conf. Data Mining (ICDM’02), pp.51-58, 2002.
[34] J. Huan, W. Wang, J. Prins. “Efficient mining of frequent subgraphs in the presence of isomorphism”, in Proc. 2003 Int. Conf. Data Mining (ICDM’03), pp.549-552, 2003.
[35] A. Inokuchi, T. Washio, and H. Motoda. “Complete mining of frequent patterns from graphs: Mining graph data”, Machine Learning, Volume 50, pp.321-354, 2003.
[36] A. Srinivasan, R. D. King, S. H. Muggleton, and M. Sternberg. “The predictive toxicology evaluation challenge”, in Proc. of the 15th Int. Joint Conf. on Artificial Intelligence (IJCAI), pp. 1–6. Morgan-Kaufmann, 1997.
[37] J. Han, H. Cheng, D. Xin, and X. Yan. “Frequent pattern mining: current status and future directions.” Data Mining and Knowledge Discovery, pp.55-86, 2007.
[38] M. Worlein, T. Meinl, I. Fischer, and M. Philippsen. “A Quantitative Comparison of the Subgraph Miners MoFa, gSpan, FFSM, and Gaston”, in proc.
of the 9th European Conf. on Principles and Practice of Knowledge Discovery in Databases (PKDD’05), pp.392-403, 2005.