視訊資訊系統中視訊內容的相似擷取

(1)

行政院國家科學委員會專題研究計畫成果報告

視訊資訊系統中視訊內容的相似擷取

in Video Infor mation Systems

計畫編號：NSC 88-2213-E-004-006

執行期限：87 年 10 月 1 日至 88 年 7 月 31 日

主持人：沈錳坤政治大學資訊科學系

中文摘要 視訊資訊擷取的主要特性在於相似擷取與內容擷取。本計畫研究視訊內容的相似度衡量。我們提出一系列基於視訊畫面序列相似度的衡量方法及其對應的演算法。視訊畫面序列考量畫面的時間先後順序。本計畫並且評估所提出之相似度衡量的效果。 關鍵詞：視訊資訊系統、視訊內容擷取、 相似擷取 1. Abstr act

The distinguished features of video retrieval lie in the similarity measures and content-based retrieval.

In this project, the similarity measures of video content are investigated. We propose a series of similarity measures based on the similarity of frame sequence which take temporal ordering into consideration. The corresponding algorithms are also presented. The effectiveness of the developed similarity measures measured by precision and recall is also described.

Keywor ds: Video Information Systems,

Content-Based Video Retrieval, Similarity Retrieval

2、Motivations

Video access is one of the important design issues in the development of multimedia information system, video-on-demand and digital library. Video can be accessed by attributes of traditional database

techniques, by semantic descriptions of traditional information retrieval technique, by visual features and by browsing.

To support video access by visual features and browsing, structural and content analysis of video must be performed so that video can be indexed and accessed. Having performed the process of video parsing, a sequence of key frames is extracted from each segmented video shot. A sequence of key frames is a representative set of images for each shot.

Each video shot V is associated with a sequence of visual features, (v1, v2,… , vN),

where N is the number of key frames, and vj, 1≦j≦N, is a f-dimensional vector of visual feature value. Given two video shots U and V, assuming that the distance d(ui, vj) is

available, ∀ i, 1≦i≦M, ∀ j, 1≦j≦N, the goal is to define the similarity (or dissimilarity) between these video shots in consideration of temporal features.

In this project, we propose a series of shot similarity algorithms based on similarity of frame or key frame sequence. Since the algorithms apply well to both frame sequence and key frame sequence, we will use frame sequence to stand for frames or key frames in proper context in the rest of the report.

3. Similar ity of Fr ame Sequences

Given two sequences of frames, how to define the (dis)similarity between them? People often judge the similarity between videos by common subsequence. We present several similarity algorithms based on the similarity of frame sequence.

A similarity measure is symmetric if D(U, V) = D(V, U). The straightforward measure of similarity is the one-to-one optimal mapping.

(2)

In the one to one mapping, we try to map as many pairs of frames as possible under the constraint that each frame ui in U

corresponds to only one frame vj in V.

Obviously, the maximal number of mapping pairs is equal to the number of frames of shorter shot (shot with less number of frames). The mapping with minimum sum of frame distance is selected as the optimal mapping. The formal definitions are given as follows.

Definition 1 Given two video shots U = (u1, u2,… , uM), V= (v1, v2,… , vN) and the distance

d(ui, vj), ∀ i, 1≦i≦M, ∀ j, 1≦j≦N, a

mapping between them is a one-to-one relation RM from {1, 2,… , M} to {1,2, … , N},

such that

(1) |RM| = min{M, N}, where |RM| denotes the

cardinality of RM,

(2) for any two ordered pairs (i, j), (k, l) in RM,

(j < l) if and only if (i < k).

Definition 2 Given two video shots U = (u1, u2,… , uM) and V= (v1, v2,… , vN), the distance

between U and V for a given mapping RM,

D’_RM(U, V), is defined as ∑ ∈ ∀ M M R j i i j R UV d u v D ) , ( ) , ( = ) , ( .

between U and V for Optimal Mapping (OM) is defined as )} , ( { min ) , (UV D UV D M M R R OM ∀ = .

The solution of DOM(U, V) can be found

based on the approach of dynamic programming. Assume that the shorter shot is U and the longer one is V. Our goal is to find the subsequence of V which is the most similar to U. Let D[i, j] be the minimum cost of mapping between (u1, u2,… , ui) and (v1, v2,… , vj). It is not hard to see that there are

two possibilities:

map: the frame vnis mapped with the frame

um, D[m, n] = D[m-1, n-1] + d(um, vn).

ignore: the frame vn is not selected to be

mapped with the frame um, D[m, n] = D[m,

n-1].

Combining these two cases, we get the following recurrence relation for the solution

of DOM(U, V).    − + − − = ] 1 , [ ) , ( ] 1 , 1 [ min ] , [ n m D v u d n m D n m D m n ,

with D[0, j] = 0, for all j, 1≦j≦N, and D[i, 0] = ∞, for all i, 1≦i≦M. Note that the relation RM can be constructed by backtracking the

matrix D[M, N].

Optimal mapping is the one-to-one frame mapping. However, sometimes the key frames are extracted by uniform sampling. It is likely that, in the extracted sequence of key frames, two consecutive key frames are similar. In addition, sometimes, two sequences of key frames are extracted by non-uniform sampling but with different thresholds. Therefore, given two similar shots, more number of key frames are extracted for the shot with lower threshold. It is necessary to measure the sequence similarity based on many-to-many frame mapping.

Definition 4 Given two video shots U = (u1, u2,… , uM), V= (v1, v2,… , vN) a mapping with

replication is a many-to-many relation RMR

from {1, 2, … , M} to {1,2, … , N}, such that (1) for each i, 1≤ i≤M, there exists at least one j, 1≤j≤N, such that (i , j) ∈RMR,

(2) for each j, 1≤j≤N, there exists at least one i, 1≤i≤M, such that (i , j) ∈RMR,

(3) for any two ordered pairs (i , j), (k, l) in RMR, (j≤l) if and only if (i≤k).

between U and V for a given mapping RMR,

D’_RMR(U, V), is defined as ∑ ∈ ∀ MR MR R j i j v i u ( d = V U R D ) , ( ) , ) , ( .

Definition 6 Given two video shots U = (u1, u2,… , uM) and V= (v1, v2,… , vN), and d(ui, vj), ∀i, 1≤i≤M, ∀j, 1≤j≤N, the distance between U and V for Optimal Mapping with Replication(OMR) is defined as )} ( { ) ( _min _R U,V R V , U OMR D _D MR MR ∀ =

Similar to DOM(U, V), the solution of

DOMR(U, V) can be found based on the

approach of dynamic programming. There are three possible relations between D[m, n]

(3)

and D[i, j] for some combinations of smaller is and js.:

map: the frame vnis mapped with the frame

um, D[m, n] = D[m-1, n-1] + d(um, vn).

replicate vn: the frame vn is replicated to

mapped with the frame um, D[m, n]

= D[m-1, n] + d(um, vn).

replicate um: the frame umis replicated to be

mapped with the frame vn, D[m, n]

= D[m, n-1] + d(um, vn).

Combining these three cases, we get the following recurrence relation for the solution of DOMR(U, V).     + − + − + − − = ) , ( ] 1 , [ ) , ( ] , 1 [ ) , ( ] 1 , 1 [ min ] , [ n m n m n m v u d n m D v u d n m D v u d n m D n m D ,

with D[0, 0] = 0, D[0, j] = ∞, for all j, 1 ≤j

≤N, and D[i, 0] = ∞, for all i, 1 ≤i≤M. A similarity measure is asymmetric if D(U, V) ≠ D(V, U). In general, asymmetric similarity measure is used to map between the query sequence of frames and video sequence of frames. The simplest proposed asymmetric similarity measure is the Optimal Subsequence Mapping (OSM). The algorithm of OSM is similar to that of OM except that the query video sequence must be the shorter sequence.

Similar to the similarity measure OMR in symmetric measures, we defined the OSMR in asymmetric measures as follows.

Definition 7 Given the query shot Q = (q1, q2,… , qM), the video shot V = (v1, v2,… , vN),

M ≦ N, a subsequence mapping with replication is a one-to-many relation RSMR

from {1, 2, … , M} to {1,2, … , N}, such that (1) for each i, 1≦i≦M, there exists at least

one j, 1≦j≦N, such that (i, j) ∈RSMR,

(2) for each j, 1≦j≦N, there exists one i, 1 ≦i≦M, such that (i , j) ∈RSMR,

(3) for any two ordered pairs (i, j), (k, l) in RSMR, (j < l) if and only if (i≤k).

and the distance d(qi, vj), ∀i, 1≦i≦M, ∀j, 1

≦j≦N, the distance between Q and V for a given mapping RSMR, D’_RSMR(Q, V), is defined

as ∑ ∈ ∀ SMR SMR R j i i j R QV d q v D ) , ( ) , ( = ) , ( .

Definition 9 Given the query shot Q = (q1, q2,… , qM) and the video shot V = (v1, v2,… , vN), the distance between U and V for

Optimal Subsequence Mapping with

Replication (OSMR) is defined as

)} , ( { min ) , (QV D QV D SMR SMR R R OSMR ∀ =

Sometimes, it is required that the mapped frames are consecutive. Therefore, the constraint of mapping relation is more strict.

Definition 10 Given the query shot Q = ( q1, q2,… , qM), the video shot V = (v1, v2,… , vN),

M ≦N, a consecutive mapping is a one-to-one relation RCM from {1, 2,… , M} to {1,

2,… , N}, such that

(1) for each i, 1 ≦i ≦M, there exists one j, 1 ≦j ≦N, such that (i , j) ∈RCM,

(2) for any two ordered pairs (i , j), (k, l) in RCM, [(j - l) = 1] if and only if [(i - k) = 1].

and d(qi, vj), ∀ i, 1 ≦i≦M, ∀ j, 1 ≦j≦N,

the distance between Q and V for a given consecutive mapping RCM, D’_RCM(Q, V), is defined as ∑ ∈ ∀ CM CM R j i j v i q ( d = V Q R D ) , ( ) , ) , ( .

Definition 12 Given the query shot Q = (q1, q2,… , qM) and the video shot V = (v1, v2,… , vN), the distance between Q and V for

Optimal Consecutive Mapping is defined as )} ( { ) ( _min Q,V R R V , Q OCM D _D CM CM ∀ = .

Algor ithm Optimal Consecutive Mapping (OCM) for j = 0 to M-N do D[0, j] = 0; for j = 0 to M-N do for i = 1 to N-1 do D[i, i+ j] = D[i-1, i+ j-1]+d(qi, vi+ j); D[M, M] = D[M-1, M-1]+d(q_M, v_M); for j = M+ 1 to N do D[M, j] = min(D[M-1, j-1]+d(q_M, v_j), D[M, j-1]); r etur n D[M, N]

Next, we extend the definition of Optimal Consecutive Mapping to Optimal Consecutive Mapping with Replication (OCMR). In OCMR, each frame of query sequence is permitted to map with more than one frame of video sequence. In addition, the

(4)

same as OCM, the mapped frames of video sequence must be consecutive.

Definition 13 Given the query video shot Q = ( q1, q2,… , qM), the video shot V = (v1, v2,… , vN), M ≦N, a consecutive matching with

replication is a one-to-many relation RCMR

from {1, 2, … , M} to {1,2, … , N}, such that (1) for each i, 1 ≦i ≦M, there exists at

least one j, 1 ≦j ≦N, such that (i , j) ∈ RCMR,

(2) let pmax(i) be max{ |( , )j i j ∈R_CMR},

pmin(i) be min{ |( , )j i j ∈R_CMR}, for each

j, pmin(1) ≦j ≦pmax(M), there exists one i,

1 ≦j ≦M, such that (i , j) ∈RCMR,

(3) for any two ordered pairs (i , j), (k, l) in RCMR, if (i < k) then (j < l),

(4) any two ordered pairs (i , j), (k, l) in RCMR,

if [(j - l) = 1] then either [(i - k) = 1] or [(i - k) = 0 ].

Definition 14 Given the query video shot Q = (q1, q2,… , qM), the video shot V = (v1, v2,… , vN) and d(qi, vj), ∀ i, 1 ≦i ≦M, ∀ j, 1 ≦j

≦N, the distance between Q and V for a given consecutive mapping RCMR, D’_RCMR(Q,

V), is defined as ∑ ∈ ∀ CMR CMR R j i j v i q ( d = V Q R D ) , ( ) , ) , ( .

Definition 15 Given the query shot Q = (q1, q2,… , qM) and the video shot V = (v1, v2,… , vN), the distance between Q and V for

Optimal Consecutive Mapping with

Replication is defined as )} ( { ) ( _min Q,V R R V , Q OCMR D _D CMR CMR ∀ = .

The solution of OCMR can be obtained as follows. First, if a mapping R is an OCMR, then the first frame q1 must be mapped with only one frame of video sequence, so does the last query frame qM. Otherwise, suppose

that q1 is mapped with frames va,… vi-1, vi,

qM is mapped with vj, vj-1 … , vb. We can

derive a mapping with less distance by removing the mapping pairs between q1 and va,… vi-1, and those between qM and vj-1 … ,

vb. Therefore the behaviors of q1 and qM are

the same as those of frames in OCM and the behaviors of q2 , q3 , … , qM-1, qMare the same

as those of frames in OSMR.

4. Exper iments

To evaluate the performance of the proposed similarity measures, we have a database of 100 video shots. These video shots were MPEG-II video which digitized from CTS news. The length of these video ranges from 32 to 205 seconds. Five sample query video are selected from the database. For each video, a sequence of key frames were extracted. Each video clip is decompressed first and the key frames are extracted by the process of non-uniform key frames extraction. Each key frame is represented as a 64-bin color histogram in HSV color space. We measure the performance by precision and recall. The ground truth to determine relevant video is judged by humans. Using the proposed similarity measures OM and OMR, the sample query returns a list of candidate video and the precision-recall values are calculated.

Average precision vs. recall (OM)

0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall Precision

Figure 1. The average precision-recall curve for similarity measure OM.

Average precision vs. recall (OMR)

0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall Precision

Figure 2. The average precision-recall curve for similarity measure OMR.

Figure 1 and 2 show the average precision-recall curve for similarity measures OM and OMR respectively. The result shows

(5)

that both measures performs well, especially when only one video is returned for query video. In this case, the precision is one. That is, both measures return the most similar video in the first rank. Moreover, it can be seen that OM performs better than OMR. This phenomenon can be realized as follows. OM is the one-to-one mapping and leaves out unmatched frames while OMR is the many-to-many mapping in which all frames are considered. However, from the human perception human point of view, two video shots are similar only if there are similar frames between them. Therefore, OM behaves like human perception.

5. Conclusions

In this project, we have proposed a series of video similarity measures based on similarity of frame sequence. The similarity algorithms based on the approach of dynamic programming are also presented. The experiment results show that similarity measure OM perform well than OMR.

In fact, the performance highly depends on the extraction process of video content. We plan to measure the performance by considering the effect of content extraction process. In addition, in the proposed similarity measures, two dissimilar frames are permitted to be mapped. The similarity measures in which the dissimilarity constraint is imposed need to be investigated.

6. Contr ibution

[1] M. K. Shan and S. Y. Lee, Content-Based Similarity Measures for Video Based on Similarity of Frame Sequence, IEEE IW-MMDBMS’98 International Workshop on Multimedia Data Base Management Systems, Dayton, Ohio, 1998.

[2] M. K. Shan and S. Y. Lee, A Generic Framework for Similarity Measures of Content-Based Video Retrieval, submitted to Pattern Recognition Letters, 1999.

7. Refer ences

[1] N. Dimitrova and F. Golshani, Rx for Semantic

Video Database Retrieval, In Proceedings of ACM Multimedia’94, San Francisco, CA, pp. 219-226, 1994.

[2] M. Flickner et al., Query by Image and Video Content: The QBIC System, IEEE Computer, Vol. 28, No. 9, pp. 23-32, 1995.

[3] A. Gupta and R. Jain, Visual Information Retrieval,

Communications of ACM, Vol. 40, No. 5, pp. 71-79, 1997.

[4] L. A. Rowe , J. S. Boreczky and C. A. Eads, Indexes for User Access to Large Video Databases, In Proceedings. of Storage and Retrieval for Image and Video Databases II, IS&T/SPIE Symposium on Electronic Imaging Science & Technology, San Jose, CA, pp. 150-161, 1994.

[5] J. K. Wu, A. D. Narasimhalu, B. M. Mehtre, C. P. Lam and Y. J. Gao, CORE: A Content-Based Retrieval Engine for Multimedia Information Systems, Multimedia Systems, Vol. 3, No. 1, pp. 25-41, 1995.

[6] M. M. Yeung and B. Liu, Efficient Matching and Clustering of Video Shots, In Proceedings of International Conference on Image Processing’95, Washington, DC, pp. 338-341, 1995.

[7] H. J. Zhang, A. Kankanhalli and W. Smoliar, Automatic Partitioning of Full-Motion Video,

Multimedia Systems, Vol. 1, No. 1, pp. 10-28, 1993.

[8] H. J. Zhang, D. Zhong and S. W. Smoliar, An Integrated System for Content-Based Video Retrieval and Browsing, Pattern Recognition, Vol. 30, No, 4, pp. 643-658, 1997.

視訊資訊系統中視訊內容的相似擷取

行政院國家科學委員會專題研究計畫成果報告