• 沒有找到結果。

Video data indexing by 2D C-trees

N/A
N/A
Protected

Academic year: 2021

Share "Video data indexing by 2D C-trees"

Copied!
23
0
0

加載中.... (立即查看全文)

全文

(1)

* Department of Computer Science and Information Engineering , National Chiao Tung University, 1001 Ta Hsueh Road, HsinChu, Taiwan 30050, Republic of China, email: fjhsu@info4.csie.nctu.edu.tw. tComputer & Communication Research Laboratories, Industrial Technology Research Institute, HsinChu,

Taiwan Accepted 5 May 1998

Video data contains a large amount of spatial and temporal information. The changes of video frames are quite useful for motion analysis and cannot be provided by other media easily. Chang et al. had proposed an effective 2D string approach for spatial indexing of image data. In this paper, we extend this iconic approach and apply it to video data indexing. We propose 2D C-trees to represent the spatial content within individual frames. A video sequence can then be represented and indexed by a temporal set or an ordering set of 2D C-trees. The similarity retrieval of video matching problem becomes the problem of video sequence matching by computing the similarity, or the minimum cost of matched frames. We present three schemes, full-sequence matching, segment matching and subsequence matching, for video information retrieval. The matching schemes can be modified and extended to approximate sequence matching by computing the partial distance for providing a comprehensive retrieval of video data. A prototype video information system is also developed to validate the effectiveness of video data indexing by 2D C-trees.

( 1998 Academic Press

Notation

Ab the begin-bound of object A

Ae the end-bound of object A

R the root of a 2D C-tree

si the ith immediate descendant of node S

e empty-node

J a rooted tree

e12ek the editing operations

D the cost function of editing operation d(J1, J2) the tree distance between tree J1and tree J2

g (J) the number of symbols in tree J Dw D the size of a frame w

d(r, q) the frame distance between reference frame r and query frame q

V the reference sequence

s Corresponding author.

(2)

vi the i th frame of reference sequence V

U the query sequence

uj the j th frame of query sequence U

D[i, j ] the minimum cost of transforming from v1v22vi to u1u22uj

W(V, U ) the full-sequence matching distance between sequences V and U

q a matching threshold

d[i, j ] the partial cost of transforming from v1v22vi to u1u22uj

t (V, U ) the subsequence matching distance between sequences V and U c (r, q) the partial distance between reference frame r and query frame q

1. Introduction

WHEN MANAGING a large amount of data of any kind, indexing is useful for rapid retrieval. As we move towards on-line digital libraries and global access to digital data, indexing is becoming more important in database management systems [1]. Video is a medium with high complexity that contains a large amount of spatial and temporal information. It can provide more information than text, graphic and even image. The information related to the position, distance, temporal and spatial relationship is included in the video data implicitly. In current video database systems, fundamental techniques such as keyword-based searching [2] and hierarchical video icon browsing [3], are provided for user query. Most of the previous researches in video data are focused on motion and scene analysis [4]. Very little work has been done on the design of index structures that characterize spatial and temporal attributes for video databases.

Video frame sequences, as compared with a single video frame, contain information about the dynamic aspects of a recorded scene. Especially, the changes of video objects in equally divided temporal intervals are quite useful for motion analysis and cannot be provided easily by other media. For example, we can ask a query like: retrieve all the video sequences in which there is a running athlete holding a basketball, crossing over the field, jumping up,

hanging over the basket and then slamming. Much spatial and temporal information of video

sequence needs to be extracted to serve this query. These spatial and temporal relations play the indexing role in content-based video information retrieval [5—7]. Unfortunately, the spatio-temporal information of video data cannot be extracted automatically without human interaction with the current techniques in image/video understanding and recognition. Even if the spatial and temporal relationships among objects or semantic units can be inferred from an objects location and duration, state-of-the-art software for manipulating video does not ‘know’ about such objects [8], for example, what a basket-ball is. Nevertheless, an iconic approach with human involvement can compensate the difficulty and deficiency.

The iconic approach by 2D strings for spatial indexing was initially proposed by Chang et al. [9] to represent symbolic pictures. First, the objects in original pictures are identified after preprocessing by either image processing and pattern recognition techniques or human involvement. Then the orthogonal relation objects with respect to other objects are generated and iconized. After all the iconic objects have been processed, the symbolic picture that preserves the spatial relationships among objects in the original picture is encoded as a 2D string. The problem of pictorial information

(3)

(1) Developing a video indexing structure that characterizes the spatio-temporal information of video content.

(2) Applying knowledge representation techniques to the development of index construction and retrieval tools.

(3) Developing a video information retrieval environment for interacting with video objects.

In the paper, the spatial indexing for pictures by 2D strings are briefly introduced in Section 2. Then we propose a new knowledge representation, 2D C-tree [11], to be the spatial indexing for pictures in Section 3. The 2D C-tree is employed for single frame matching on similarity retrieval. Section 4 describes how 2D C-trees are extended to video sequences matching in video information retrieval. We propose three matching schemes: full-sequence matching, segment matching and subsequent matching. The algorithms of the matching schemes are developed in Section 5. The video sequence matching algorithms are modified and extended to approximate sequence matching for providing a comprehensive video information retrieval methodology discussed in Section 6. In Section 7, we also present a prototype system in our video information retrieval project. Finally, conclusions and future works are summarized.

2. Spatial Indexing

The 2D string approach for spatial indexing was initially proposed by Chang et al. [9] to represent symbolic images. Three spatial relation operators ‘\’, ‘\’, and ‘ : ’ are employed in 2D strings. The operator ‘\’ denotes the ‘left—right’ or ‘below—above’ spatial relation. The operator ‘\’ denotes the ‘at the same spatial location as’ relation. The operator ‘ : ’ denotes the ‘in the same set as’ relation. The symbolic picture f1in

Figure 1(a) can be represented as the 2D string (A\D: E\B\C, A\B\C\D : E) or as (A\DE\B\C, A\B\C\DE ) where the symbol ‘ : ’ can be omitted and is omitted.

The 2D string representation is also suitable for formulating picture queries. In fact, we can imagine that the query can be specified graphically, by drawing an iconic image on the screen of a computer. The graphic representation, called icon sketch, can be translated into the 2D string representation. For example, we may want to retrieve images satisfying a certain icon sketch q1as in Figure 1(b). Then q1can be translated into

the 2D string (A\E\C, A\C\E). This query string is a sub-string of the 2D string representation of the example image f1. The problem of image retrieval then becomes

the problem of 2D string subsequence matching.

However, the spatial operators of 2D strings are not sufficient to give a complete description of spatial knowledge for images of arbitrary complexity. Many extended

(4)

Figure 1. A symbolic image and a query sketch

Table 1. The definition of characteristic spatial operators

Notation Condition Meaning

A\B Ae\Bb Adisjoins B

ADB Ae\Bb Ais edge to edge with B

A\B Ab\Bb, and Ae\Be Ais the same as B

A[ B Ab\Bb, and Ae[Be Acontains B and they have the same begin-bound A] B Ab\Bb, and Ae\Be Acontains B and they have the same end-bound A%B Ab\Bb, and Ae\Be Acontains B and they do not have the same bound A/B Ab\Bb\Ae\Be Ais partly overlapping with B

Note: The notations Aband Ae(Bband Be) denote the values of begin- and end-bound of object A(B ), respectively.

representations were proposed to handle more types of relations between pictorial objects, but they are not economic for complex images in terms of storage space efficiency and navigation complexity. Lee and Hsu [12] proposed 2D C-string repres-entation with a set of spatial operators and a more efficient cutting mechanism. They employed a characteristic set of spatial operators illustrated in Table 1 to give a complete description for images of arbitrary complexity.

Basically, the 2D C-string approach performs a cut to handle the cases of objects with partly overlapping. The global operators ‘\’ and ‘D’, which are employed in the original 2D string approach, handle the cases of non-overlapping. The extended operators ‘\’, ‘[’, ‘%’, and ‘]’, called the local operators, and a pair of separators ‘( )’ handle the cases of overlapping. The picture f2in Figure 2(a) is similar to f1in Figure 1(a) except that the

objects in f2 are non-zero-sized objects as opposed to point objects in f1. The 2D

C-string representation of the picture f2is (A ] D [EDB DC, AD B%C D D]E). It is noted

that all the objects in f2keep intact without cutting because the case of partly overlapping

does not happen.

The 2D C-string is efficient in representing and manipulating images, but it is not suitable for solving the problem of 2D string subsequence matching. For example, we

(5)

Figure 2. A symbolic image with non-zero sized objects and a query sketch

use a query sketch q2 as in Figure 2(b). The 2D C-string representation of the query

image is (A %E\C, A\C\E). Then this query string of q2is quite different from the

example string of f2due to the spatial operators. The former string is not the substring of

the latter string any longer. Unfortunately, these operators are needed to handle the global and local relations among symbolic objects in a 2D C-string and cannot be omitted.

Though the inference of the spatial relations between objects from a given 2D C-string in spatial reasoning can be solved by using the ranking mechanism [13], the computation of object ranks in a 2D C-string is somewhat complicated. Moreover, all the spatial relationships of objects pairs, which is O(n2) for n objects in an image, are required to be reasoned first by adopting the 2D longest common subsequence algorithm. The algorithm for similarity retrieval actually finds a maximum clique [13] and becomes an NP-complete problem, though there are some polynomial-time algorithms for the average case. Therefore, we explore a more effective representation for spatial indexing of images.

3. 2D C-tree

We first introduce the basic structure of a 2D C-tree [11], which is an ordered labeled tree. The 2D C-tree representation still employs the sparse cutting mechanism to handle the case of symbolic images with partly overlapping objects. The cutting mechanism performs the required cuts to get rid of the ambiguity incurred due to partly overlapping. After cutting, an image is partitioned to some portions between two neighboring cuts. All the portions are sequentially linked to a root, R, which is initialized to represent the margin or boundary of the area covered by a given image.

The 2D C-tree representation was developed originally with spatial operators asso-ciated with the links of the tree. Each node with a label, or symbol name, represents an object in the image. The links connecting two nodes are signed with the relation operators. For the ordered subtree rooted at node S with n immediate descendants in the ordering s1, s2, 2 , sn, S being the parent actually contains the local body consisting of all

(6)

Figure 3. The general 2D C-trees of image f2. (a) f2xon x-coordinate axis; (b) f2yon y-coordinate axis

its immediate child-nodes s1, s2,2, sn. For simplifying the tree representation, empty-node and set-node are employed in order to remove the relation operators from the tree according to the basic definition of the operators. An empty-node is a pseudo-node which is labeled ‘e’ and is used to represent empty areas of any size. The relation operator of a link can be removed by inserting some suitable empty-nodes. Each empty-node is considered as an object of its own. The set-node is introduced for treating a set of lineage that each node has single-child-node. A set-node is a multi-label node consisting of objects that have the same begin-bound and end-bound. The detailed transformation rules are investigated by Hsu et al. [11]. The sample symbolic image f2in Figure 2(a) is represented

in general 2D C-trees as shown in Figure 3.

Ordered labeled trees are trees whose nodes are labeled and in which the left to right ordering among siblings is significant. The distance and/or similarity of such trees have many applications in computer vision, pattern recognition, programming compilation and natural language processing [14]. The distance between two ordered trees is considered to be the weighted number of editing operations (insert, delete and relabel) to transform one tree to another. Each operation is associated with a cost function, denoted asD. Let E be a sequence e1, e2, 2, ekof editing operations. An E-derivation

from tree J1to tree J2is a sequence of trees A0, A1, 2, Aksuch that A\A0, B\Ak, and Ai[1]Ai, via eifor 14i4k. Then the cost of E-derivation is measured by the total weight of editing operations in E, i.e.

D(E)\@+E@

i\1

D(ei) (1 )

Formally the editing distance between trees J1 and J2is defined as

d(J1, J2)\minMD(E)DE is an editing sequence from J1to J2N (2)

In Hsu et al. [15], we proposed a 2D C-tree matching algorithm by modifying the tree matching algorithm developed by Zhang [16] to compute the editing distance between 2D C-trees for solving the image retrieval problem. Consider two pictures P1and P2.

Two 2D C-tree representations of P1 (P2), J1x(J2x) and J1y(J2y) along the x- and

(7)

Figure 4. The 2D C-trees of query sketch q2. (a) q2x on the x-coordinate axis; (b) q2y on the y-co-ordinate axis

d(P1, P2), to be equal tod(J1x, J2x) * d(J1y, J2y). If one of the distancesd(J1x, J2x) and

d(J1y, J2y) is zero,d(P1, P2) is defined to be equal to the other. The example picture f2in

Figure 2(a) and query sketch q2in Figure 2(b) are used to illustrate the computation of

picture distance. The 2D C-trees of f2 and q2 along the x-coordinate axis are in

Figures 3(a) and 4(a), respectively. The editing distance along the x-coordinate axis between these two trees is the cost of editing operations needed to transform from f2xto

q2x. At least two editing operations are needed. That is, in f2xsymbol D is deleted and

symbol B is replaced by an empty nodee. The tree distance of d( f2x, q2x) is 2. Along the

y-coordinate four delete operations are needed for symbols B, D and e from f2yto q2y.

That is, the cost ofd( f2y, q2y) is 3. Finally, the distance ofd( f2, q2) is 6.

The proposed 2D C-tree matching algorithm takes O(n2

1n22) time complexity for

computing the editing distance between two trees consisting of n1 and n2 nodes,

respectively. While all the three distances between the query and the sample images in the database have been computed, the most similar image can be obtained. Suppose that there are N images in databases, P1, P2, 2, PN, and a query image Q. The most similar image(s) to Q is

MPiDd(Pi, Q ) is the minimum ofd(Pk, Q ), 14k4NN.

4. Video Information Retrieval

The structural matching of 2D C-trees plays an effective role for spatial query between images. Sometimes an image is viewed as a frame in video data ignoring sound. Then we can directly extend the approach of iconic image indexing and apply to video data indexing for video information retrieval.

A video sequence consists of a number of frames. Each frame can be constructed into two 2D C-trees along the x- and y-direction. The 2D C-trees preserve the spatial relationships among objects in an individual frame. A video sequence is then represent-ed and indexrepresent-ed by an ordering set of 2D C-trees. The essence of video is that it captures the movements of objects, and it is therefore necessary to allow searching for a changing object relationship within a video sequence. The capability to find frames within a video sequence that contain objects in a specified relationship is useful [1]. Recall an example query in Section 1: retrieve all the video sequences in which there is a running athlete holding

a basketball, crossing over the field, jumping up, hanging over the basket and then slamming. This

(8)

Figure 5. An example query sequence of five frames

For human beings, similarity retrieval is closer to conceptual thinking. Therefore, the query or ‘match’ need not be an ‘exact match’. We match each query frame against each frame of every reference sequence in video databases. The matching process measures the similarity of frames, which computes the distance between the query frame and the reference frame. Note that if a reference frame r is matched by a query frame q, we need to count the distance between them, i.e. d(r, q). Otherwise, we count the size of the frames r and q as the distance.

Definition 1(Size of a frame). The size of a frame w is defined as the product of the sizes of its 2D C-tree representations along the x- and y-axis direction, wxand wy, respectively. The size of a 2D C-tree wx(wy) is the number of symbols in wx(wy), termed asg(wx) (g(wy)), except the root node. Thus, the size of a frame w, referred to as Dw D, is (g(wx)[1) * (g(wy)[1).

For video sequence matching, we accumulate the total differences or distances between the frames of the query sequence and the frames of the reference sequence as the similarity of these sequences. The smaller the distance is, the more similar the two sequences are.

For the retrieval of similar video sequences, we propose three matching schemes in video information retrieval. For the illustration of these schemes, suppose there are mframes in the reference sequence V in the ordering of v1, v2, 2, vmand n frames in the query

sequence U in the ordering of u1, u2, 2, un.

The full-sequence matching scheme is finding the minimum cost of full-sequence comparison between V and U. Every frame in the reference sequence and in the query

sequenceshould be considered (see Figure 6).

The full-sequence comparison is defined by computing the editing distance required to change the reference sequence into the query sequence. Three types of changes between sequences are allowed: (1) delete—delete a frame from the reference

sequence, (2) insert—insert a frame into the query sequence, and (3) replace—replace a frame

of reference sequence with a frame of query sequence. For a change from v1, v2,2, vi to

u1, u2, 2, uj, we compute the minimum cost of full-sequence comparison between

Vand U.

The second matching scheme is considered for query segment, called segment matching. Usually, a reference sequence contains lots of frames and users may not be able to specify an entire query sequence practically. Therefore, users specify an essential segment for video matching. For example, we may ask ‘please retrieve video sequences that contain this query segment’. Consider the sequences V and U, and assume mAn.

(9)

Figure 6. Full-sequence matching scheme

Figure 7. Segment matching scheme

Figure 8. Subsequence matching scheme

We only compare the continuous n frames of the reference sequence V with the query segment U. A segment of V, v1vi]12vi](n[1), where 14i4m[n]1, matches the query segment U, u1u22un, correspondingly (see Figure 7).

Sometimes users may specify a query via some key frames. These key frames may intermittently match a subsequence of the reference sequence. The third matching

(10)

scheme is used for matching the query subsequence, called subsequence matching. Only the distance of matched frames is considered. The unmatched frames being nonsignifi-cant or redundant in reference sequence are ignored and their cost is not included. Figure 8 illustrates a subsequence matching between the reference sequence V and the query

subsequence U. The frame of V, vi@, matches the query frame ui, 14i4n. If vi@ and

vj{matches uiand uj correspondingly for i\j, then i@\j @ must hold.

Therefore, we can retrieve similar video sequences from the video databases via a certain query by using the above matching schemes.

5. Matching Algorithms

In this section we develop the matching algorithms for the video sequence matching schemes. For explaining the algorithm, the reference sequence V\v1v22vmand the query

U\u1u22un are used to illustrate the matching.

5.1. Full-Sequence Matching Scheme

The full-sequence matching between V and U is to compute the editing distance required to transform from V to U. Three types of editing operations between sequences are considered: delete, insert and replace. The cost of deleting a reference frame vior inserting a query frame ujis weighted the size of the concerned frame, i.e.DviD orDujD, respectively. The cost of the replacing operation is the distance between two transformed frames. For example, the ith frame of reference V, vi, is replaced by the j th frame of query U, uj. The replace cost isd(vi, uj). For a transformation from v1, v22vito u1u22uj, the minimum cost, denoted by D[i, j ] is defined as the minimum value of the following three cases.

(1) D [i[1, j ]]DviD (delete a frame vi), (2) D [i, j[1]]DujD (insert a frame uj),

(3) D [i[1, j[1]]d(vi, uj) (replace frame vi by frame uj). That is,

D[i, j ]\minMD[i[1, j ]]DviD, D [i, j[1]]DujD, D[i[1, j[1]]d(vi, uj)N. (3) The well-known Symbolic String Sequence Comparison algorithm using the dynamic programming technique [17] can be extended to solve the frame sequence matching problem. Suppose that all the frames of the video sequence V(U ) are constructed into respective 2D C-tree representations, and the video sequence is represented and indexed by an ordering set of 2D C-trees. First, we must compute all the frame sizes and frame distances between those frames of V and U. Then the minimum change from V to U, denoted byW(V, U ), is D[m, n] computed by the above recurrence relation between v1v22vmand u1u22un, i.e.W (V, U )]D[m, n].

This Full-Sequence Matching algorithm is developed as follows. Algorithm 1: Full-Sequence Matching Algorithm.

Input: The frame-to-frame distance tabled(vi, uj) for replacing viwith uj, and the frame

(11)

D[i, j ]\min(D[i[1, j ]]V [i ], D[i, j[1]]U [ j ], D[i[1, j[1]]d(vi, uj));

return D [m, n]; End;

It is clear from the above algorithm that the running time is O(mn). An example is demonstrated below. Sequence V has six frames and sequence U has five frames. Suppose that the sizes of these 11 frames are all 25. The frame distances among them computed as described in Section 3 are made and listed in Table 2.

The computation of D[6, 5] between sequences V and U is shown in Table 3. The minimum cost of editing distance between sequences V and U,W(V, U ) , is 36. From Tables 2 and 3 we can trace back the correspondence of matching frames. The correspondences (v1, u1,2), (v2, u2, 2), (v3, null, 25), (v4, u3,6), (v5, u4, 0), and (v6, u5, 1)

constitute an optimal matching [18]. Each triple (vi, uj,cost) denotes a mapping from

vito ujassociated with its cost between viand uj. The notations we used in our sequence matching algorithm are: a delete operation when ujis null and an insert operation if viis null. In this case, v3is deleted in the transformation from V to U.

Table 2.The frame distances between sequences V and U

d(vi, uj) v1 v2 v3 v4 v5 v6 u1 2 4 6 9 12 16 u2 0 2 4 6 6 8 u3 16 12 8 6 4 2 u4 6 4 2 1 0 1 u5 12 9 6 4 2 1

Table 3.The computation of D [6, 5] between sequences V and U V D 0 1 2 3 4 5 6 0 0 25 50 75 100 125 150 1 25 2 27 52 77 102 127 U 2 50 25 4 29 54 79 104 3 75 50 29 12 35 58 81 4 100 75 54 31 13 35 59 5 125 100 79 56 35 15 36

(12)

This minimum cost computation between two sequences is useful for retrieving similar sequences. Like the image query, we can retrieve similar sequences from databases via a query sequence based on editing distance. Suppose that there are N sequences in database, S1, S2,2, SN, and a query sequence T. The most similar sequence(s) to T is Si, such thatW(Si, T ) is the minimum one of all N sequences:

MSiDW(Si, T ) is the minimum of W(Sk, T ), 14k4NN.

5.2. Segment Matching

Consider sequences V and U. We want to find a segment of V, vivi]12vi](n[1), where 14i4m[n]1, which matches the query segment U correspondingly. Usually, the user cannot precisely specify the query segment frame by frame. We use a threshold q to allow a certain difference between the corresponding frames. If the distance between two corresponding frames viand uj is smaller thanq, these two frames match and constitute a matched pair (vi, uj). The problem becomes a string matching problem, which can be solved by the famous KMP algorithm [19] easily, ifq\0. Unfortunately, the toleranceq somewhat affects the matching and the computation of the slide position for backtracking in the KMP algorithm. The slide operation must be rechecked and cannot be allowed when the distance is over the threshold. For such a reason, we must propose a segment matching algorithm:

Algorithm 2: Segment Matching Algorithm

Input: The frame-to-frame distance tabled(vi, uj) for replacing viby uj, and a thresholdq.

Output: The start positions of matched subsequences in the sample video sequence.

Begin for i :\1 to m[n]1 k\i ; j\1; while j4n then ifd(vk, uj)\q then k\k]1; j\j]1; else break;

if j[n then print matched position i ; End;

We use an example data extended from Table 2 to illustrate the segment matching process. Suppose that there are 12 frames in reference sequence V, and a query segment U consists of five frames. Their frame-to-frame distances are shown in Table 4.

Obviously, there does not exist any segment of five frames in sequence V matching the query segment U exactly. In other words, U is not an exact segment of sequence V. If we allow a threshold of q\10, meaning that two frames match when their distance is smaller than 10, then we can obtain four segments of V matching the query, u1u2u3u4u5, according to the above algorithm. That is v5v6v7v8v9, v6v7v8v9v10,

v7v8v9v10v11and v8v9v10v11v12. The costs of these matched segments are 15, 15, 17 and 20,

(13)

u5 20 20 16 16 12 9 6 4 2 1 0 1

It is interesting to find out that the cost of an unmatched segment v4v5v6v7v8is 19,

which is in fact smaller than that of a matched subsequence v8v9v10v11v12, which is 20.

Thus, for different applications users may change the selecting criteria to be the total cost of two subsequences. We can make a little modification from the segment matching algorithm for this purpose.

Algorithm 3: Modified Segment Matching Algorithm

Input: The frame-to-frame distance tabled(vi, uj) for replacing viwith uj, and a threshold

of total cost,-.

Output: The start positions of matched subsequences in the sample video sequence.

Begin for i :\1 to m[n]1 k\i ; j\1; tc\0; while j4n then if (tc]d(vk, uj))\- then k\k]1; j\j]1; tc\tc]d(vk, uj); else break;

if j[n then print matched position i ; End;

Note that these two segment matching algorithms also take O(mn) time to match the segments.

5.3. Subsequence matching

The subsequence matching only considers the subsequence consisting of key frames specified in the query segment. The subsequence matching algorithm can employ the dynamic programming technique like the full-sequence matching algorithm (Algorithm 1) except ignoring the cost of unmatching frames in the reference sequence. In other words, the cost of deleting an unmatched frame in the reference sequence is weighted zero. On the contrary, every frame in a query segment needs to be matched. If a frame of a query segment is matched, the cost of matching is the distance between the corresponding frames. Otherwise, the cost of an unmatched frame in a query segment is measured in frame size as defined before. As stated above, the subsequence matching algorithm is developed as the following.

(14)

Algorithm 4: Subsequence Matching Algorithm

Input: The frame-to-frame distance tabled(vi, uj) for replacing viwith uj, and the query

frame size U[n ].

Output: The minimum subsequence distance between V and U, W(V, U ).

Begin

for i :\0 to m do d[i, 0]\0;

for j :\ 1 to n do d[0, j ]\d[0, j[1] ] U[ j ]; for i :\1 to m

for j :\1 to n

d[i, j ]\min(d[i[1, j ], d[i, j[1]]U [j ], d [i[1, j[1]]d(vi, uj)); return d[m, n];

End;

Obviously, the subsequence matching algorithm employing the dynamic program-ming technique also takes O(mn) time complexity for computing the minimum subsequ-ence distance.

We still use the example data in Table 4 to illustrate the subsequence matching. The computation of d[m, n] between a sequence V and a query segment U is shown in Table 5.

It is assumed that the sizes of all the frames in query V are 25. From the computation of d[i, j ], we can easily trace back the optimal correspondence of subsequence matching. In this example, there are two corresponding mappings, referred to as the set of triples (vi, uj, cost), producing an optimal match in distance 5.

(1) (v3, u1, 0), (v5, u2, 0), (v9, u3, 4), (v10, u4, 1), and (v11, u5, 0).

(2) (v3, u1, 0), (v5, u2, 0), (v10, u3, 2), (v11, u4, 2), and (v12, u5, 1).

This means that the optimal subsequence matching between V and U is obtained at W(V, U )\5. Similarly, the subsequence matching approach provides a valuable solu-tion for video sequence matching. Like the full-sequence, we can retrieve similar sequences from the sequences S1, S2, 2, SN via a certain query segment T specifying some key frames (see Figure 9). We only replace the full-sequence distance function W(Si, T ) with the subsequence distancet(Si, T ) . The most similar sequence(s) to T is

MSiDt(Si, T ) is the maximum of t(Sk, T ), 14k4NN. Table 5.The computation of d [12, 5] between sequence V and a segment U

V d 0 1 2 3 4 5 6 7 8 9 10 11 12 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 25 4 2 0 0 0 0 0 0 0 0 0 0 U 2 50 29 8 4 1 0 0 0 0 0 0 0 0 3 75 54 33 28 20 17 12 8 6 4 2 1 0 4 100 79 58 45 36 26 21 14 9 6 5 4 4 5 125 104 83 70 61 48 35 27 18 11 7 5 5

(15)

Figure 9. Choose the most similar sequence form S1, 2, SNmatching T

6. Approximate Sequence Matching

In general, similarity retrieval [20] is required since users usually cannot express queries in a precise way. For a frame of abundant contents, users often cannot specify a precise description in a query. The objects existing in a reference frame may not be expressed in a rough query. This strategy provides a more convenient and user-friendly manner in an interactive environment. The users can specify a query sequence consisting of some key frames concentrating on key objects. Only the key objects in the query sequence are considered and compared. In such a case, the superfluous objects in the query sequence are considered and compared. In such a case, the superfluous objects in a reference frame can be ignored on purpose when they do not appear in the query frame. This means that the cost of deleting an object in a reference frame is weighted zero. Meanwhile, the case of a relabel operation, which replaces a symbol in a reference frame with an empty-node in a query frame, is viewed as a special case of the delete operation. On the contrary, all the objects specified in a query are needed for measuring the distance. Therefore, the computation of editing distance between a reference frame and a query frame is modified into a partial function. The partial tree matching algorithm is proposed in Hsu et al. [15] for computing the partial tree distance. Note that this algorithm also takes O(n21n

2 2) time

(16)

n2, respectively. We use c(reference, query) to represent the partial distance between

a reference frame and a query frame. Similar to the definition of tree distance d, the partial distance c is the product of two partial tree distances in two axis directions respectively.

To demonstrate the computation of partial distance, we use the example picture f2in

Figure 2(a), which is regarded as the reference frame, and query sketch q2in Figure 2(b),

which is considered as query frame. The 2D C-trees of f2and q2are in Figures 3 and 4,

respectively. Because the editing operations needed to transform from f2 to q2are all

delete operations (even if a relabel operation needed in the x-direction is actually a special case of the delete operation) the partial tree distance between them are both zero for two axis directions, respectively. That is, the partial distance between f2

and q2,c( f2, q2), is zero. In such a case, the reference f2matches the query q2with zero cost.

In other words, the query q2 is a partial description of the reference f2 without any

difference.

Now, we can employ the similarity strategy via partial distance to integrate the matching algorithms discussed in Section 5. The example data in Table 6 is made for demonstrating the computation of partial distances between a reference sequence V and a query sequence U.

For the three matching schemes discussed in Section 4, the proposed matching algorithms can be modified and extended for approximate sequence matching by replacing d(vi, uj) with the partial distance c(vi, uj). We use Algorithms 2 and 4 to illustrate the approximate sequence matching.

The Segment Matching algorithm can select the similar segment(s) in V under a threshold. We just replaced(vi, uj) with c(vi, uj) in Algorithm 2. There are only two segments in V matched underq\3, v6v7v8v9v10and v7v8v9v10v11. The segment v6v7v8v9v10

has the minimum cost correspondence for the query U.

The Subsequence Matching algorithm is modified similarly. The distance computa-tion funccomputa-tiond(vi, uj) is replaced with the partial distancec(vi, uj). We can easily obtain an approximate subsequence in reference sequence V comparing with U. Only one matching correspondence, M(v3, u1, 0), (v5, u2, 0), (v8, u3, 2), (v9, u4, 0), (v11, u5, 0)N, is

se-lected under a constraint threshold 4, representing the size of each query frame. The computation of approximate sequence matching is shown in Table 7 and the optimal matching is obtained at a cost equal to 2.

Therefore, the approximate sequence matching can provide a comprehensive re-trieval of video data. The general mechanism in retrieving the most similar sequence(s) from N video sequences in the database, S1, S2, 2, SN, via a query sequence T is performed through the following three steps.

Table 6. The partial distances between source stream V and a sequence U c(vi, uj) v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 v11 v12 u1 2 1 0 1 1 1 2 3 3 4 4 4 u2 3 2 1 1 0 1 2 2 2 3 3 4 u3 4 4 4 4 4 3 3 2 2 1 1 0 u4 4 4 3 2 2 2 2 1 0 1 1 2 u5 4 4 4 4 3 2 2 2 1 1 0 1

(17)

U 2 8 6 4 2 1 0 0 0 0 0 0 0 0

3 12 10 8 6 5 4 3 3 2 2 1 1 0

4 16 14 12 10 8 7 6 5 4 2 2 2 2

5 20 18 16 14 12 11 9 8 7 5 3 2 2

Step 1:Compute all the partial distances of the frame pairs between a sequence Skand T,

c(ski, tj), 14i4DSkD, 14j4DT D.

Step 2: Compute the minimum partial subsequence matching for Sk and T, t(Sk, T ).

Step 3: Repeat Steps 1 and 2 for all sequence Sk, 14k4N. Then choose the one(s)

with minimum cost to be the most similar matched video sequence(s) for T. In summary, for a query sequence T consisting of n frames, we must compute the minimum partial distance O(mn) time taken to compare N reference sequences that have mframes and choose the most similar sequence with minimum cost in the databases.

7. Prototype System

We use the above mechanisms to implement an Interactive Video Information System (IVIS) in our experimental project. The schematic diagram of the IVIS is shown in Figure 10.

There are three subsystems in the IVIS. The query subsystem provides users an interactive interface to query and browse the retrieved results. The system supports single frame query and frame sequence query. The interface allows users to draw a query image by assembling object icons designed to use an existing query template consisting of query frames. The subsystem contains a query processing module and an interactive browser module. To achieve content-based retrieval, a friendly user interface is required for users to refine query visually and navigate their way through the database visually. The information management subsystem is the core of an IVIS. There are four modules in the subsystem: content acquisition module, video indexing module, video store/retrieval module and index matching module. When we collect the source video streams, the content acquisition module processes the raw data and identifies some key image frames. The representative objects in key frames are also extracted. These tasks are not easy to achieve automatically using techniques in pattern recognition and image understanding at the current stage and are performed instead in a human-assisted fashion. After extracting the representative objects from identified key frames, the video indexing module can construct the 2D C-trees to be the representative indexing of key frames. All the video streams and key frames are stored in the data storage subsystem by the video store/retrieval module. The index matching module computes the minimum

(18)

Figure 10. Schematic diagram of an interactive video information system

matching distance for the query processing subsystem based upon the proposed approximate sequence matching mechanism.

There are three major databases in the data storage subsystem. The constructed 2D C-trees are the representative indexing of video streams and are kept in the Index Store. All the source video data in the AVI file format are compressed to the compressed video streams for reducing the storage space. The identified key frame data are also collected in the database for providing the annotation with indices to speed up retrieval of the desired video streams.

In the experimental project, we capture 48 streams from ‘ The Lion King’ cartoon produced by The Walt Disney Company. Each stream takes 99 s and consists of about 1500 frames. There are 351 key frames that have been identified in all and some are listed in Appendix A. Each stream has 7 or 8 key frames on an average. For this popular animation, 78 roles are chosen to be the objects, which are extracted from the key frames in a human-assisted fashion. These objects are classified into 28 categories represented by a set of designed icons in the system. The labels of objects are listed in Appendix B. The objects and their bounding rectangles within images are also extracted

(19)

Figure 11. An example template query for approximate sequence matching

after capturing the images from the source video. Each frame containing about five objects on an average is constructed into two 2D C-trees along the x- and y-axis directions independently and these two indexing representations have been stored. The number of objects within each frame implies the number of nodes in its corresponding 2D C-trees. In IVIS, we concentrate on video information retrieval based upon approximate sequence matching mechanism. We allow users to draw a sequence of query frames by assembling object icons designed or use an existing query template consisting of some consecutive query frames. An example is shown in Figure 11. Before computing the distances among frames, two 2D C-trees of each frame in the query sequence are constructed first. Starting from the first reference sequence of the video database, we compute all the partial distances between the frame pairs of this reference sequence and the

querysequence. The subsequence matching distance between this reference sequence and

the query sequence is easily computed by Algorithm 4. While all the sequences of the database are compared with the query one by one, the minimum one represents the most similar sequence for the query. The result via a query template in Figure 11 is shown in Figure 12. After retrieving a stream, the user can manipulate the stream via basic operations, such as Play, Reverse, Pause, Goto, Backward, Forward and so on.

For the 48 video streams, we have used each representative key frame sequence to be the test query sequences against the others. Each comparison takes 15 to 25 ms in a Pentium-133 CPU with 64MB RAM environment. It is clear that all the video streams are exactly retrieved from the databases without distance zero. Our initial results validate the effectiveness of similarity retrieval by 2D C-trees matching.

8. Conclusions

Chang et al. proposed an effective 2D string approach for spatial indexing of image data. To represent the spatio-temporal information, we extend the approach of iconic image

(20)

Figure 12. A result stream via a query template in Figure 11

indexing to video data indexing in this paper. We investigate a new knowledge model, 2D C-tree, to characterize the spatial information of video content. The indexing technique represents the spatial content within individual frames and constructs an ordering set of 2D C-trees for a video sequence. Each frame is constructed into two representative 2D C-trees along the x- and y-axes. The ordering set of 2D C-trees then becomes the representative indexing of the video sequence. The similarity between two frames is measured by the editing distance between their corresponding 2D C-trees. The video sequence matching problem is solved by computing the minimum cost of corresponding frames. For various applications or demands, we propose three matching schemes for video information retrieval: full-sequence matching, segment matching and

subsequ-ence matching. We also extend the matching schemes to approximate sequence matching

by computing the partial distances between reference frames and query frames. The approximate sequence matching provides a comprehensive operation for retrieving video data. We also briefly present our developed prototype based on the proposed strategy in video information retrieval.

This work concentrates on handling the spatial-temporal relationships of video data. Nevertheless, how to integrate with other useful features [21], such as color, shape, texture, and even voice, for video retrieval is worthy of exploring in many practical applications, such as news database, electronic shopping and distance learning. More-over, the automatic content extraction, especially object recognition or identification in video frames, is also among our future research direction.

(21)

Appendix A

(22)

Appendix B

Seventy eight roles are chosen to be the objects in ‘ The Lion King’ cartoon

Category Object

Lion Mufasa, Simba~cub, Simba~adult, Scar, Simba~son, Sarabi, Nana~cub, Nana~adult, Sanafina, Lioness, Lionesses

Baboon Rafiki, Baboon1, Baboons

Hornbill Zazu

Hyena Banzai, Shenzi, Ed, Hyena1, Hyenas

Warthog Pumbaa

Meerkat Timon

Animal Hippo, Elephant, Rhino, Giraffe, Zebra, Antelope, Leopard, Ape, Vulture, Wildebeest, Squirrel, Bird, Crocodile, Mouse

Animals Hippos, Elephants, Rhinos, Giraffes, Zebras, Antelopes, Leopards, Apes, Vultures, Wildebeests, Squirrels, Birds, Crocodiles

Insect Ant, Butterfly, Worm

Land Pride~land, Cracked~land

Rock Pride~rock, Rock

Grass Dark~grass, Bright~grass

Sun Sun

Valley Valley

Tree Tree1, Trees, Forest

Sky Dark~sky, With~stars, Bright~sky

Termite~mt Termite~mt Gorge Gorge Thornbushes Thornbushes Flame Flame Jungle Jungle Pond Pond Waterfall Waterfall Stream Stream Mount Mount Flowers Flowers Skull Skull Cub~painting Cub~painting

References

1. K. Shearer, S. Venkatesh & D. Kieronska (1996) Spatial indexing for video databases. Journal of Visual Communication and Image Representation 7,325—335.

2. E. Oomoto & K. Tanaka (1993) OVID: Design and implementation of a video-object database system. IEEE Transactions on Knowledge and Data Engineering 5, 629—643.

3. S. W. Smoliar & H. Zhang (1994) Content-based video indexing and retrieval. IEEE Multimedia 1, 62—72.

(23)

3,25—41.

8. A. Gupta, T. Weymouth & R. Jain (1991) Semantic queries with pictures: the VIMSYS model. In: Proceedings of the 17th International Conference on Very Large Data Bases, Barcelona, pp. 69—79.

9. S. K. Chang, Q. Y. Shi & C. W. Yan (1987) Iconic indexing by 2-D strings. IEEE Transaction on Pattern Analysis and Machine Intelligent PAMI-9,413—428.

10. S. Y. Lee, M. K. Shan & W. P. Yang (1989) Similarity retrieval of iconic image database. Pattern Recognition 22,675—682.

11. F. J. Hsu, S. Y. Lee & P. S. Lin (1997) 2D C-tree spatial representation for iconic image. In: Proceedings of the 2nd International Conference on Visual Information Systems (Visual ’97 ), San Diego, CA, pp. 287—294.

12. S. Y. Lee & F. J. Hsu (1990) 2D C-string: a new spatial knowledge representation for image database systems. Pattern Recognition 23, 1077—1087.

13. S. Y. Lee & F. J. Hsu (1992) Spatial reasoning and similarity retrieval of images using 2D C-string knowledge representation. Pattern Recognition 25, 305—318.

14. K. S. Fu (1982) Syntactic Pattern Recognition and Application. Prentice-Hall, Englewood Cliffs, NJ.

15. F. J. Hsu, S. Y. Lee & P. S. Lin (1998) Similarity retrieval by 2D C-trees matching in image databases. Journal of Visual Communication and Image Representation 9, 87—100.

16. K. Zhang (1995) Algorithms for the constrained editing distance between ordered labeled trees and related problems. Pattern Recognition 28, 465—474.

17. U. Manber (1989) Introduction to Algorithms: A Creative Approach. Addison-Wesley, Reading, MA.

18. Y. P. Wang & T. Pavlidis (1990) Optimal correspondence of string subsequences. IEEE Transaction on Pattern Analysis and Machine Intelligent PAMI-12, 1080—1087.

19. D. E. Knuth, J. H. Morris & V. R. Pratt (1977) Fast Pattern matching in strings. SIAM Journal of Computing 6,323—350.

20. P. Ciaccia, F. Rabitti & P. Zezula (1996) Similarity Search in Multimedia Database Systems. In: Proceedings of the 1st International Conference on Visual Information Systems (Visual ’96 ), Melbourne, Australia, pp. 107—115.

21. M. Flickner, H. Sawhney, W. Niblack, J. Ashley, Q. Huang, B. Dom, M. Gorkani, J. Hafner, D. Lee, D. Petkovic, D. Steele & P. Yanker (1995) Query by image and video content: the QBIC system. IEEE Computer 44, 23—32.

數據

Figure 1. A symbolic image and a query sketch
Figure 2. A symbolic image with non-zero sized objects and a query sketch
Figure 3. The general 2D C-trees of image f 2 . (a) f 2 x on x-coordinate axis; (b) f 2 y on y-coordinate axis
Figure 4. The 2D C-trees of query sketch q 2 . (a) q 2 x on the x-coordinate axis; (b) q 2 y on the y-co- y-co-ordinate axis
+7

參考文獻

相關文件

Primal-dual approach for the mixed domination problem in trees Although we have presented Algorithm 3 for finding a minimum mixed dominating set in a tree, it is still desire to

The function f (m, n) is introduced as the minimum number of lolis required in a loli field problem. We also obtained a detailed specific result of some numbers and the upper bound of

Breu and Kirk- patrick [35] (see [4]) improved this by giving O(nm 2 )-time algorithms for the domination and the total domination problems and an O(n 2.376 )-time algorithm for

[r]

We compare the results of analytical and numerical studies of lattice 2D quantum gravity, where the internal quantum metric is described by random (dynamical)

In this study, we compute the band structures for three types of photonic structures. The first one is a modified simple cubic lattice consisting of dielectric spheres on the

n SCTP ensures that messages are delivered to the SCTP user in sequence within a given stream. n SCTP provides a mechanism for bypassing the sequenced

What is the number of ways a binomial random walk that is never in the negative territory and returns to the origin the first time after 2n steps.. • Let n