A Simple Tree Pattern-Matching Algorithm

全文

(1)A Simple Tree Pattern-Matching Algorithm Hsiao-Tzu Lu, Wuu Yang Department of Computer and Information Science, National Chiao Tung University, Hsinchu, Taiwan, ROC. E-mail: [email protected]. Abstract | Tree pattern matching occurs as a crucial step in a number of programming tasks.. a. We propose a new al-. gorithm to solve the tree pattern-matching problem.. a. The. b. algorithm may be viewed as an extension of the KnuthMorris-Pratt string-matching algorithm to the tree pattern-. b. matching problem. In the new algorithm, the times of each node in the sub ject tree needs to be traversed is bounded by their level. Therefore, the time complexity of the simple tree pattern-matching is bounded by. O(n log n),. where. n. is the number of nodes in the sub ject tree. The worst case occurs when the frequency of the same content of the pattern's root in the sub ject tree is high. But we need an extra. d. c. d. (a). c. (b). Fig. 1. (a)The pattern. (b)The subject tree.. preprocessing time of the pattern. The time complexity of the pattern preprocessing is bounded by. m. O(m log m), where. The key idea of the bottom-up matching algorithms in [3] is to

(2) nd, at each point in the subject tree, all patterns and O(n + m) all parts of patterns which match at this point. When the pattern forest F is simple (a simple pattern forest is a set of trees that contains no independent subtrees), Ho mann I. Introduction O'Donnell can construct the subsumption graph for it Tree pattern matching is an interesting special problem and and set the tables via the subsumption graph. We can see which occurs as a crucial step in a number of program- an example of the subsumption graph in Fig. 2. ming task. It occurs frequently in the context of tree replacement systems and has applications in di erent areas a(a(v,v),b) a(b,v) of computer science, including automatic implementation of abstract data types, code optimization, automatic proof systems, syntax-directed compilation and evaluations for a(v,v) b programming languages such as LISP [3] [4]. And the tree replacement approach is very convenient for producing interpreters for implementing experimental languages. v Tree pattern matching is an extension to the problem of pattern matching in strings. We extend the Knuth-MorrisFig. 2. An example of the immediate subsumption graph. Pratt string-matching algorithm to tree patterns. Similarly, we need to preprocess the pattern

(3) rst in order to speed up the matching method and reduce the number of times that nodes need to be compared. And in our method, b, c, 1, 2 a a the sibling nodes in the tree are ordered (i.e. the subject a 1 c tree and the pattern are ordered trees). We take Fig. 1 for a c example. We cannot

(4) nd out a full match of the pattern 2 b c a of Fig. 1(a) in the subject tree of Fig. 1(b), though the b v a 1 pattern in Fig. 1(a) and the subject tree in Fig. 1(b) are 2 2 c (a) similar. a b is the number of nodes in the pattern.. The worst case. occurs similarly when the frequency of the same content of the root in the pattern is high.. By using indirection, the. space complexity will be down to. .. 2. II. Existing Approaches to Tree Pattern Matching. 3 (b). Ho mann and O'Donnell [3] proposed several algorithms Fig. 3. (a)A pattern. (b)The associated matching automaton. to solve the tree pattern-matching problem. Their algorithms may be classi

(5) ed into two categories: one is a bottom-up approach, and the other is a top-down ap- The top-down matching algorithm uses a matching auproach. While the bottom-up method generalizes string tomaton, and we can see an example in Fig. 3. In Fig. 3, matching, the top-down method reduces tree matching to the automaton is associated with the pattern of the example in Fig. 3(a). Accepting states are circled twice and are a string-matching problem..

(6) labeled with the length of the accepted path string [3]. The top-down matching algorithm is slower than the bottom-up matching algorithms, but has shorter preprocessing time. Rational patterns are used to specify recognizable tree languages. Simon [4] proposed an eÆcient tree pattern matching algorithm and applied it on nets. The algorithm solves the tree pattern matching in (j j j j) steps where there is some match of rational pattern in . O. p. p. LEVEL 1. 1 2 4. 3 6. 5. t. 3. 7. 11 12. t. 2. 4. 15. Fig. 5. A sample tree.. III. Our Approach. A. Extension of the KMP String-Matching Algorithm. The basic concept of the KMP string-matching algorithm are discussed in [5] and [6]. The auxiliary table next in the KMP string-matching algorithm may be shown pictorially. Take TABLE I for example, Fig. 4 is also a representation and the arcs represent the next pointers. One of the basic concept in the KMP string-matching algorithm is precomputation of the shifts, as in tree patternmatching problem, we can preprocess the pattern and precompute the shifts which is called back in our algorithm. The di erence is that we shift the pattern not only to move the pattern horizontally right or left but also move it vertically up or down. We show the complete algorithm in the following sections.. 2 we can easily determine where the parent, the left child, and the right child of any node are in the binary tree [2]. LEMMA 1 [The maximum number of nodes in a binary tree] [1] [2] (1) The maximum number of nodes on level of a binary tree is 2 1 , 1. (2) The maximum number of nodes in a binary tree of height is 2 1, 1. LEMMA 2 If a complete binary tree with nodes is represented sequentially, then for any node with index , 1 , we have [2] (1) ( ) is at b 2c if 6= 1. If = 1, is at the root and has no parent. (2) ( ) is at 2 if 2 . If 2 , then has no left child. (3) ( ) is at 2 + 1 if 2 + 1 . If 2 + 1 , then has no right child. We use the numbering representation in order to show our idea more clearly, and to determine the relationship between nodes more easily. We will show the complete algorithms in section B.2 for pattern preprocessing and in section B.3 for searching the pattern in the subject tree. B.2 Pattern Preprocessing In our algorithm, we need to record information about the pattern. In particular, we need to record the content, nextnode and back of each node. The content

(7) eld is the content of this node. The content can be a character, a digit, a symbol, etc. The nextnode

(8) eld is the next node's index in the left-to-right breadth-

(9) rst traversal of the pattern. The back

(10) eld records the index of the node that should be compared subsequently when a mismatch occurs. Consequently, the pattern is stored in an array of the structure f g. The steps of pattern preprocessing are shown in Fig. 6. For convenience, we divide the algorithm into three parts: part I is for initialization, part II is the preprocessing, and part III is the Descend procedure used in part II. As we initialize the data structure for the pattern, we

(11) rst initialize the content and nextnode

(12) elds of the pattern. We record the content of each node in the content

(13) eld. The nextnode

(14) eld of a node as mentioned before is the next node's index in the left-to-right breadth-

(15) rst traversal of the pattern. But the nextnode

(16) eld of the last node in the left-to-right breadth-

(17) rst traversal is recorded as 0, because there is no other node follows it. The back

(18) eld of a node is initialized as 0 when the content

(19) eld of the i. i. k. i. k. k. n. i. i. n. parent i. i=. Lef tC hild i. RightC hild i. i. i. i. i. i. n. i. i. pattern a b a a b a b a a b a a b a b a a. Fig. 4. The other presentation of TABLE I.. B. A Simple Tree Pattern-Matching Algorithm. B.1 Presentation The level of a node in the tree is de

(20) ned as follows. The root is at level 1. If a node is at level , then its children are at level + 1 [2]. Fig. 5 shows the levels of all nodes in that tree. The height of a tree is de

(21) ned to be the maximum level of any node in the tree [2]. We use a numbering scheme to represent a binary tree in memory. Suppose we number the nodes in a complete binary tree starting with the root on level 1, continuing with the nodes on level 2, and so on. Nodes on any level are numbered from left to right. The numbering representation can clearly be used for all binary trees, though in most cases there will be a lot of unutilized space [2]. We can see the number of each node of the tree in Fig. 5. The number labeled in the circle is the index of a node in our numbering presentation. Since the nodes are numbered from 1 to , we can use a one-dimensional array to store the nodes. Using LEMMA l. l. n. content; nextnode; back. i. i > n. n. i. i. > n.

(22) TABLE I The sample table of the KMP string-matching algorithm.. j 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 pattern[j] a b a a b a b a a b a a b a b a a next[j] 0 1 0 2 1 0 4 0 2 1 0 7 1 0 4 0 2 node is not equal to the content

(23) eld of pattern's root, and as 1 otherwise. The initialization steps were shown in Part I of Fig. 6. We will take the example in Fig. 7 as the pattern and show the result of initialization in TABLE II. In TABLE II, some nodes'

(24) elds are empty. In fact, those nodes don't exist in the tree. Those nodes waste our space, and we provide a solution to save the space. We will discuss the solution later.. a b c. content. i. f. i. i. nextnode.

(25) elds of pattern; ). P i :nextnode. g. TABLE II. index 1 2 3 4 5 6 7 8 9 10 11 12 13. :content. P i :back. P i :back. Part II f or ( i := 2; i 6= 0; i := P [i]:nextnode ) f if P [i]:content = P [1]:content then f. := ; Descend(i,1); //

(26) nd the most repeated region //. base. g. g. i. Part III. Descend(i,index) // compare for descendants, if the corresponding // // nodes' contents are the same, then continue // // comparing to

(27) nd the most repeated region. // f. := [ ] ; := blog2 c; := 2 ( 1) + ; // compute the corresponding node // if [ ] 6=null then if [ ] 6= [ ] then if [ ] then [ ] := //

(28) nd the most repeated region // else Descend(k,index); // continue comparing next node // index. P index :nextnode. temp k. index. temp. base. a. The initialization result of the pattern structure in Fig. 7. back. P. b. Fig. 7. A sample tree.. // initialize

(29) eld of pattern nodes // if [ ] = [1] then [ ] := 0 else [ ] := 1; P i :content. b. b c. Part I. index:=1; base:=1; initialize the and f or ( := 1; 6= 0; := [ ]. a. index. content. nextnode. back. b c. 13 0. 1 1. a b a c b b a. De

(30) nition. 2 3 4 5 6 7 12. 0 1 0 1 1 1 0. 1. Let be a tree, and be two nodes of where is an ancestor of . We show the tree pictorially as Fig. 8. (1) The subtree of with root is called the v -subtree of . (2) The subtree of the -subtree where is the last node in the left-to-right breadth-

(31) rst traversal is called the v -u part of . The - part of is essentially the -subtree of with all nodes whose indices are greater than 's index removed. t. v. v. u. t. u. t. t. v. t. v. t. v u. u. t. v. t. u. P k :content. P k :content. P index :content. t. P k :back < index P k :back. g. t v. v. index. v-u part v-subtree u. Fig. 8. The v-subtree and the v-u part. Fig. 6. Algorithm of pattern preprocessing.. De

(32) nition. 2. Let and be two trees, be the node of , t. s. u. t.

(33) [ ] be the node of whose index is , and [ ] be the node of whose index is . (1) When we want to search in ,

(34) rst we need to

(35) nd out a node of that its content is the same as the root in and we can possibly

(36) nd out in -subtree of . We call the node the headnode. (2) When the node of is the headnode, the node of and the root of which is [1] are corresponding nodes. The corresponding node of [ ] is [ ], if and only if [ ] is at the same relative position in -subtree with [ ] in . For example, if [ ] and [ ] are corresponding nodes, then [ ]'s left (or right) child and [ ]'s left (or right, respectively) child are corresponding nodes. In Fig. 9, root of tree corresponds to node of tree . Therefore, node of tree is the corresponding headnode of tree . Node 7 of tree is at the same relative position with node of -subtree, so the corresponding node of node 7 is node . The dotted lines in Fig. 9 denote the pairs of corresponding nodes. t i. t. s. i. s k. k. t. u. s. s. t. t. u. s. u. u. s. s. u. t. t. t i. s k. s k. u. t i. t i. t. s k. t i. s k. t. c. s. c. s. t. t. n. c. n. [ ] = [1] ; [ ] is a node of [1]-subtree, its corresponding node in [ ]-subtree is [ ]; The level of [ ] in the [1]-subtree is blog2 c +1; [ ]'s corresponding node [ ] is also on the same level of [ ]-subtree, and [ ], [ ] are at the same relative position; c= 2blog2 2blog2 c b log2 c =2 ( 1) + 2 The numbering scheme we use has some characteristics. One is that the leftmost node on level is numbered as 2 1 [1] [2]. P base :content. P. :content. P index. P. P base. P k. P index. P. index. : : : P index. P k. P base. P index P k. : : : index : : : k. index. index. k. base. index. i. i. 4. Let be a tree pattern. When [ ] = , we can

(37) nd a subtree of [ ]- [ ] part of pattern that is isomorphic to f [1]- [ ] part [ ]g where [ ] is the corresponding headnode of [1]-subtree and = + 1. 2 log2 LEMMA. P. P k :back. i. P base. P. 1. P i. 2. k. i. b. ic. Proof :. We number the nodes in [1]- [ ] part from 1 to [1]- [ ] part and +1 i.e. there are ( + 1) nodes in their corresponding nodes of subtree in [ ]- [ ] part from (1) to ( +1) ; [ (1) ] represents [ ] [ ( +1) ] represents [ ]; LEMMA 1 assures that we can

(38) nd correct corresponding nodes. c( And ( ) = 2blog2 1) + 8 ( + 1) According to our algorithm, when [ ] = : [ 1] = [ (1) ] [ 2] = [ (2) ] .. . [ ] = [ ( )] But [ 6= [ +1 ] +1 ] (1) (2) f [ ] [ ] [ ( ) ]g is the same structure of tree as f [ 1 ] [ 2 ] [ ]g, and each corresponding node pair's contents are the same; f [ (1) ] [ (2)] [ ( ) ]g is isomorphic with f [ 1] [ 2] [ ]g; f [ (1)] [ (2) ] [ ( ) ]g is a subtree of [ ]- [ ] part of the pattern ; We can

(39) nd a subtree of [ ]- [ ] part isomorphic with f [1]- [ ] [ ]g 2. 6. 4. 7. v u k n. 13. P i. node. j. k. k. P. P i. p. P base ;. i. k. nodei. ;P k. base. j. P k. nodei. P k :back. q k. Fig. 9. Corresponding nodes.. P node. :content. P k. :content. P node. :content. P k. :content. P nodej :content. P k. P nodej. :. P k. P k. j. : : : P k. m h. b. base. P base. d. 3 f. P base. P. nodej. c. P k. P i. P. a. s. t. index. base. j. :content. ;P k. i. j. i. :content P k. j. :content. j. ;:::;P k. After initialization, we need to

(40) nd out the most appropriate back value for each node. We traverse the pattern in the left-to-right, breadth-

(41) rst order. Only the descendants of the nodes whose contents equal the content of pattern's root need to be compared. The steps of computing back are shown in the Descend procedure of Fig. 6. In the Descend procedure, we need to compute the indices of the corresponding nodes. We will prove the correctness of the index computation in LEMMA 3. Besides, we must prove that the value of the back

(42) eld we compute LEMMA 5. The value of the back

(43) eld is the largest, i.e., is correct and is the largest, and the proofs are shown in we cannot

(44) nd any other back value that is greater and LEMMA 4 and LEMMA 5. satis

(45) es LEMMA 4. Proof : LEMMA 3. Let be tree, [ ] be the node of whose Suppose [ ] = ; index is , [ ] be the node of whose index is Assume that we can

(46) nd 0 which 0 and 0 can sat, and [ ] be the node of whose index is . When isfy LEMMA 4, i.e., we can

(47) nd a new corresponding root the headnode is [ ], [ ]'s corresponding node is [ 0]; [ ], where Let represents the level of node [ 0 ], represents the c( = 2blog2 1) + level of node [ ]. : :. P node. : : :. P k. P node. : : :. P k. ; P node. ;P k. ; : : : ; P nodej j. ;:::;P k. ; P node ;P k. ; : : : ; P nodej. ;:::;P k. j. P base. P. : : :. P base. P. P. base. index. P base. P index. P. P k :back. P. Proof : P. i. P base. P base. > i. P base. index. We want to

(48) nd corresponding nodes between two subtrees : [1]-subtree and [ ]-subtree [1]'s corresponding node is [ ]; P. i. Li0. base. P k. :. i. P index. index. P i. k. P k. k. part. P. P k. P base. P i. 0. P i. Li. P i. : : : i > i : : : Li0 : : : base. = (2blog2 c + 1) (2blog2 c + 1) = = +1 0= +1. base. 0. i. k. i. Li. k. 0. i. Li0. i. Li. i. P k.

(49) 0 < base. TABLE IV. : : : base. According our algorithm, P [base0 ] should be compared ear-. The indirection structure of the tree in Fig. 7. lier than [ ] and [ ] = 0; When [ ] be the headnode, P base. P k :back. index. i. 0. : : : i < i : : : P k :back : : :. [] unchanged. We cannot

(50) nd another 0 greater than and satisfy LEMMA 4. 2 i. i. TABLE III The result of preprocessing pattern in Fig. 7. index 1 2 3 4 5 6 7 8 9 10 11 12 13. content. nextnode. a b a c b b a. 2 3 4 5 6 7 12. 0 1 0 1 1 1 0. O m. m. m. P k. P k. k. m. b c. 13 0. 4 1. P. ;P. m. m. O m. ;P. ;P. P. P. P. < index. P. P. :back. P. m. :back. index; number. index. number. O m. number. O. index. m. indirection. O m. a. O. O m. ;P. P. P P. of the preprocessing step is bounded by ( log2 ), where is the number of nodes in the pattern. The worst case may occur when there are many nodes in the pattern that have the same content as the root. For example, if the ancestors of node [ ] in the pattern have the same content as the root, then node [ ] may need to be visited at most blog2 c times. The space complexity is dependent with the structure of the pattern. When the pattern is skew and has nodes, the space complexity is (2 ); when the pattern is a complete binary tree and has nodes, the space complexity is ( ). Therefore, the space complexity is from (2 ) to ( ). However, we can use indirection to save the space. We use an extra array of structure f g for the tree. The is the number of the node's sequence in the left-to-right breadth-

(51) rst traversal, and the is the index of the node in the numbering scheme we adopt. We can show it more clearly with the example in Fig. 7 in TABLE IV. Therefore the space needed for a binary tree will be down to ( ) no matter what kind of structure the tree is. Adopting the indirection into our algorithm, we need an extra searching algorithm. No matter what kind of the tree structure the pattern is, we know that . We can use binary search, and the worst case in binary search takes (log ) time complexity. Therefore, the time complexity of our algorithm in Fig. 6 with will be ( log2 ) where is the number of nodes in the pattern. O. P. :back. 1 2 3 4 5 6 7 12 13. back. We show the result of pattern preprocessing in TABLE III for the example in Fig. 7. Note that the back

(52) eld of the node whose index is 12 is 4, because f [1] [2] [3]g is isomorphic to f [3] [6] [7]g, and the nextnode

(53) eld of [3] is 4. The corresponding node of [4] is [12] with headnode [3], but their contents are not equal and [12] =1 = 4. Therefore, [12] := 4. When a mismatch occurs at [12], we can continue comparing [ [12] ] (i.e. [4]) subsequently. TABLE III may be shown pictorially as in Fig. 10. The thin edges represent the back pointers. Some nodes in Fig. 10 don't have back pointers, because their back value is 0 and their back pointers is NULL. P. number. 1 2 3 4 5 6 7 8 9. P base. m. m. B.3 Searching a Pattern in a Subject Tree For the subject tree, we need to record the , , and of each node. The c b b a

(54) eld is the next node's index in the left-to-right breadth

(55) rst traversal of the subject tree. The

(56) eld is either 0 or 1, and we initialize it as 1 in the beginning. When the b c node is visited as a headnode, we set its

(57) eld as 0 to Fig. 10. The pattern after preprocessing. avoid the repeated comparison. The

(58) eld is either 0 or 1, and we initialize it as 0 in the beginning. When The times that each node in the pattern needs to be we

(59) nd a full match in the subject tree, we set the visited is bounded by their level. The time complexity

(60) eld of the headnode in the full match as 1. Consequently, b. a. content. nextnode. f lag. match. nextnode. f lag. f lag. match. match.

(61) the data structure of the subject tree is an array of the structure f g. The algorithm of searching the pattern in the subject tree is shown in Fig. 11. content; nextnode; f lag; match. Part IV. if m:content 6= n:content then do f index. :=. n:back. ; // check back //. if index 6= 0 then head:=Base(k,index); n. := [. ];. P index. g until (m:content = n:content or index = 0);. := 0; // the corresponding headnode // := 1; initialize the and

(62) elds of nodes of the g subject tree; Part VIII f or ( := 1; 6= 0; := [ ] ) int Base(x,y) //

(63) nd the corresponding head node // f // initial the and

(64) elds of subject nodes // f [] := 0; while 6= 1 do [] := 1; // ag=0 means to stop searching // f g := b 2 c; //

(65) nd parent(x) // := b 2 c; //

(66) nd parent(y) // Part V g f or ( := 1; 6= 0; := [ ] ) := ; f [ ] := 0; // set the headnode's ag as 0 // if [ ] 6= 0 then r eturn(k); // skip nodes that have been searched // g head. index. content. i. i. i. nextnode. S i :nextnode. f lag. match. S i :match. y. S i :f lag. x. x. y. y. i. i. i. S i :nextnode. k. S i :f lag. f. x. S k :f lag. Com Des(i,1,index,head); // compare descendants // if ( 6= 0 and [ ] =0) // Match // then [ ] := 1; else := 1; // back to the root of the pattern // index. Fig. 11. Algorithm for searching the pattern in the subject tree.. P index :nextnode. TABLE V. S head :match. index. g. g. The initialization result of the subject tree structure in Fig. 7. index 1 2 3 4 5 6 7 8 9 10 11 12 13. Part VI. Com Des(i,j,index,head) f. if S [i]:content = P [j ]:content then f. head:=Base(i,j); // compute the corresponding head node index // := ; do index. f. j. := [ ] ; // next node of the pattern// if 6= 0 then // check if the last node // index. f. c ( := 2blog2 1) + ; //

(67) nd corresponding node // if [ ] =null then := 0 else Compare(S[k],P[index],index,head); index. S k. g. head. index. index. g g until (index = 0 or P [index]:nextnode = 0);. Part VII. Compare(m,n,index,head) // compare nodes // f. nextnode. f lag. match. b c. 13 0. 1 1. 0 0. a b a c b b a. 2 3 4 5 6 7 12. 1 1 1 1 1 1 1. 0 0 0 0 0 0 0. index. k. g. P index :nextnode. content. Part IV of Fig. 11 shows the initialization Part V is the main code for searching patterns in the subject tree. Part VI, Part VII and Part VIII are related procedures. In Part IV of Fig. 11, we

(68) rst initialize the and

(69) elds of the subject tree. The initialization is similar to Part I of Fig. 6. The

(70) eld is initialized as 1. The

(71) eld is initialized as 0. We show the result of the initialization of the subject tree of Fig. 7 in TABLE V. Similarly as TABLE II, some nodes'

(72) elds are empty in TABLE V. Those nodes don't exist in the tree and waste space. In Part V of Fig. 11, we traverse the nodes in the leftto-right breadth-

(73) rst order. Only nodes whose content. nextnode. f lag. match. content.

(74)

(75) elds equal to the

(76) eld of pattern's root may lead a match and has the need to compare their descendants. When a mismatch occurs, we know which node of the pattern should be compared to the node in the subject tree or just skip it. The index of a node in the pattern that should be compared subsequently comes from the

(77) eld of the pattern. When the value of the

(78) eld is 0, we skip the node in the subject tree because there is no chance to

(79) nd the pattern with this node. Procedure Com Des of Part VI in Fig. 11 computes the index of a node's corresponding node and compares if their contents are equal. Procedure Compare used in procedure Com Des handles the situation when a mismatch occurs at the node. We can immediately know the next node of the pattern should be compared subsequently. Procedure Base used in procedure Com Des and Compare is for the computation of the headnode's index in the subject tree. When we

(80) nd a corresponding headnode in the subject tree, we continue to compare the descendants to determine if there is any chance to

(81) nd the pattern in the subject tree. The steps are shown in procedure Com Des. When a mismatch occurs, we handle it with procedure Compare and we must compute the index of the new corresponding headnode by procedure Base. Similar to the pattern preprocessing part, we must compute the index of the corresponding node. And the computation is similarly with LEMMA 3, we apply it to two di erent trees. In Part V of Fig. 11, we visit each node in the subject tree as the headnode in the left-to-right breadth-

(82) rst order. Therefore, we won't miss any chance to lead a match. The times of each node in the subject tree needs to be visited depends on the frequency of its ancestors' contents equal to the content of the pattern's root. Therefore, the times that each node in the subject tree needs to be visited is bounded by their level. The time complexity of the simple tree pattern-matching is bounded by ( log2 ), where is the number of nodes in the subject tree. The worst case may occur when there are many nodes in the subject tree that have the same content as the root of the pattern. The space complexity is dependent with the structure of the subject tree. When the subject tree is skew and has nodes, the space complexity is (2 ); when the subject tree is a complete binary tree and has nodes, the space complexity is ( ). Therefore, the space complexity is from (2 ) to ( ). However, we can similarly use indirection to save the space. We use an extra array of structure f g for the tree. The is the number of the node's sequence in the left-to-right breadth-

(83) rst traversal, and the is the index of the node in the numbering scheme we adopt. Therefore the space needed for a binary tree will be down to ( ). Adopting the indirection into our algorithm, we need an extra searching algorithm. We can use binary search, and the worst case in binary search takes (log ) time complexity. Therefore, the time complexity of our algorithm in Fig. 11 with will be ( log2 ) where is the number of nodes in the subject tree.. IV. Conclusions and Future Work. content. back. back. O n. n. n. n. O. m. n. O n. O. n. O n. index; number. index. number. O n. O. indirection. O n. n. n. n. A. Conclusions. In our simple tree pattern-matching algorithm, each node in the subject tree needs to be traversed is bounded by their level. Therefore, the time complexity of the simple tree pattern-matching is bounded by ( log ), where is the number of nodes in the subject tree. The worst case may occur when there are many nodes in the subject tree that have the same contents as the root of the pattern. And we need an extra step to preprocess the pattern. The times of each node needs to be visited when we preprocess the pattern depends on the frequency of its ancestors' contents equal to the content of the pattern's root. Therefore, the times that each node in the pattern need to be visited is bounded by their level. The time complexity of the preprocessing step is bounded by ( log ), where is the number of nodes in the pattern. The worst case may occur when there are many nodes in the pattern that have the same content as the root. The space complexity depends on the structures of the pattern and the subject tree. The worst case occurs when the subject tree and the pattern both are skewed, the space complexity is (2 + 2 ), where is the number of nodes in the subject tree and is the number of nodes in the pattern. The best case is ( + ) when both pattern and the subject tree are complete binary trees. However, we can use to save the space. Therefore the space complexity will be down to ( + ). We can reduce the pattern to be stored in an array of structure f g in the left-to-right breadth-

(84) rst order and the subject tree to be stored in an array of struture f g in the left-to-right breadth-

(85) rst order. The

(86) eld is the index of the node in our numbering representation. We don't need to memorize the nextnode of each node, because the nextnode of [ ] is [ + 1], and the nextnode of [ ] is [ + 1], 8 0 ,80 . But we do need an extra searching algorithm when we are visiting the nodes. The simple tree pattern-matching algorithm shown in Fig. 6 and Fig. 11 can be applied only to binary trees. If we want to apply this algorithm to general trees, we must

(87) nd out the largest degree of the trees and extend our numbering scheme accordingly. With this method, we only need to change the computation of the indices of the parent or children nodes in our algorithm. TABLE VI summarizes the time complexity for the pattern preprocessing and matching techniques in [3] and our approach. The complexities in TABLE VI are expressed in terms of [3] patno the number of di erent patterns involved. patsize the size of the pattern forest (when patno=1, patsize is the number of the pattern nodes). subsize the size of the subject tree. ht the height of a speci

(88) c tree which is constructed as part of preprocessing. sym the number of symbols in the alphabet . rank the highest rank of any symbol in . O n. O m. n. n. m. m. O. n. m. n. m. O n. m. indirection. O n. m. number; content; back. number; content; f lag; match number. P i. S k. P i. < i < m. S k. < k < n.

(89) TABLE VI The time complexity for the preprocessing and matching techniques.. Method. Naive algorithm Bottom up with Algorithm A and B [3] Bottom up with Algorithm C [3] Top down with Algorithm D [3] Our approach Our approach for general trees Our approach with indirection. Restrictions. None Simple pattern forest Simple binary forest. Preprocessing time. None O(patsize2 rank + ht sym patsizerank ) O(patsize ht2 ). O(subsize patno) O(subsize suf ) O(subsize log subsize). O(patsize ht). O(subsize log subsize). For binary trees and patno = 1. O(patsize ht2 ). O(subsize log2 subsize). patno. ht. patsize. suf. suf. height. B. Future Work. The pattern preprocessing method we use in our approach is not eÆcient and complete. In the part of pattern preprocessing in the Knuth-Morris-Pratt string-matching algorithm, the [ ] comes from [ ] and [ ] is the largest less than such that [1] [ 1] = [ +1] [ 1]. Each node in the string needs to be visited once. So as our algorithm, if each node in the subject tree needs to be visited at most twice, the complexity of matching time can be down to ( ), where is the number of nodes in the subject tree (i.e. the size of the subject tree). But the tree structure is more complicated than string, we should keep on trying to

(90) nd out an eÆcient and correct method to handle the pattern preprocessing. And we should

(91) nd out the proper data structure for the trees. However, our simple tree pattern-matching algorithm restricts the pattern number be 1, that means our approach cannot apply on pattern forest. When the pattern number is largest than 1, the time complexities of pattern preprocessing and matching techniques will be multiple even though the pattern forest has no independent subtrees and next j. pattern j. i. f j. pattern. f j. pattern i. pattern j. O n. n. O(subsize ht2 + match). O(patsize) O(patsize) O(patsize ht). the maximum suÆx number of the path string of the tree pattern. match the number of matches which are found. When we restrict the pattern and the subject tree be binary trees and = 1, our algorithm has shorter preprocessing time than bottom up algorithms listed in the TABLE VI. But we cann't evaluate which one has shorter matching time, excepting when = , our algorithm may have shorter matching time than the method of Bottom up with Algorithm C. The method of top down with Algorithm D has better performance than our algorithm, no matter in preprocessing time or matching time. But when the is large enough (i.e. is equal to the of the subject tree), our algorithm can have the same performance as the method of top down with Algorithm D.. j. O(subsize patsize) O(subsize + match). Pattern are full trees None For binary trees and patno = 1 patno = 1. suf. i. Matching time. there are close relationships between the patterns. In [3], Ho mann and O'Donnell proposed the subsumption graph and other concepts to handle the simple pattern forest. Our future work should contains this part of pattern forest handling. References. [1] T. H. Cormen, C. E. Leiserson, and R. L. Rivest. Introduction to Algorithms. MIT Press, 1996. [2] E. Horowitz, and S. Sahni. Fundamentals of Data Structure in Pascal. Computer Science Press, 1976. [3] C. M. Ho mann, and M. J. O'Donnell. Pattern Matching in Trees. JACM 29, Vol. 29, No. 1, pp. 68{95, January 1982. [4] H. U. Simon. Pattern Matching in Trees and Nets. ACTA Informatica, pp.227{248, 1983. [5] C. M. Ho mann, and M. J. O'Donnel. Fast Pattern Matching in Strings. SIAM J. COMPUT., Vol. 6, No. 2, pp. 323{350, June 1997. [6] E. M. Reingold, K. J. Urban and D. Gries. K-M-P string matching revisited. Information Processing Letters, Vol. 64, pp. 217{ 223, 1997. [7] D.S. Hirschberg. A Linear Space Algorithm for Computing Maximal Common Subsequences. Communications of the ACM, Vol. 18, No. 6, pp. 341{343, June 1975. [8] J. W. Hunt and T. G. Szymanski. A Fast Algorithm for Computing Longest Common Subsequences. Communications of the ACM, Vol. 20, No. 5, pp. 350{353, May 1997. [9] D. S. Hirschberg. Algorithms for the Longest Common Subsequence Problem. JACM, Vol. 24, No. 4, pp. 664{675, October 1977. [10] K.Z. Zhang. The Editing Distance Between Trees: Algorithms and Applications. Technical Report, New York University, July 1989. [11] A. V. Aho, J. E. Hopcroft, and J. D. Ullman. The Design and Analysis of Computer Algorithms. Addison-Wesley, Reading, MA, 1974..

(92)