行政院國家科學委員會專題研究計畫成果報告
視訊資訊系統中視訊內容的相似擷取
Similar ity Retr ieval of Video Content
in Video Infor mation Systems
計畫編號:NSC 88-2213-E-004-006
執行期限:87 年 10 月 1 日至 88 年 7 月 31 日
主持人:沈錳坤 政治大學資訊科學系
中文摘要 視訊資訊擷取的主要特性在於相似擷 取與內容擷取。本計畫研究視訊內容的相 似度衡量。我們提出一系列基於視訊畫面 序列相似度的衡量方法及其對應的演算 法。視訊畫面序列考量畫面的時間先後順 序。本計畫並且評估所提出之相似度衡量 的效果。 關鍵詞:視訊資訊系統、視訊內容擷取、 相似擷取 1. Abstr actThe distinguished features of video retrieval lie in the similarity measures and content-based retrieval.
In this project, the similarity measures of video content are investigated. We propose a series of similarity measures based on the similarity of frame sequence which take temporal ordering into consideration. The corresponding algorithms are also presented. The effectiveness of the developed similarity measures measured by precision and recall is also described.
Keywor ds: Video Information Systems,
Content-Based Video Retrieval, Similarity Retrieval
2、Motivations
Video access is one of the important design issues in the development of multimedia information system, video-on-demand and digital library. Video can be accessed by attributes of traditional database
techniques, by semantic descriptions of traditional information retrieval technique, by visual features and by browsing.
To support video access by visual features and browsing, structural and content analysis of video must be performed so that video can be indexed and accessed. Having performed the process of video parsing, a sequence of key frames is extracted from each segmented video shot. A sequence of key frames is a representative set of images for each shot.
Each video shot V is associated with a sequence of visual features, (v1, v2,… , vN),
where N is the number of key frames, and vj, 1≦j≦N, is a f-dimensional vector of visual feature value. Given two video shots U and V, assuming that the distance d(ui, vj) is
available, ∀ i, 1≦i≦M, ∀ j, 1≦j≦N, the goal is to define the similarity (or dissimilarity) between these video shots in consideration of temporal features.
In this project, we propose a series of shot similarity algorithms based on similarity of frame or key frame sequence. Since the algorithms apply well to both frame sequence and key frame sequence, we will use frame sequence to stand for frames or key frames in proper context in the rest of the report.
3. Similar ity of Fr ame Sequences
Given two sequences of frames, how to define the (dis)similarity between them? People often judge the similarity between videos by common subsequence. We present several similarity algorithms based on the similarity of frame sequence.
A similarity measure is symmetric if D(U, V) = D(V, U). The straightforward measure of similarity is the one-to-one optimal mapping.
In the one to one mapping, we try to map as many pairs of frames as possible under the constraint that each frame ui in U
corresponds to only one frame vj in V.
Obviously, the maximal number of mapping pairs is equal to the number of frames of shorter shot (shot with less number of frames). The mapping with minimum sum of frame distance is selected as the optimal mapping. The formal definitions are given as follows.
Definition 1 Given two video shots U = (u1, u2,… , uM), V= (v1, v2,… , vN) and the distance
d(ui, vj), ∀ i, 1≦i≦M, ∀ j, 1≦j≦N, a
mapping between them is a one-to-one relation RM from {1, 2,… , M} to {1,2, … , N},
such that
(1) |RM| = min{M, N}, where |RM| denotes the
cardinality of RM,
(2) for any two ordered pairs (i, j), (k, l) in RM,
(j < l) if and only if (i < k).
Definition 2 Given two video shots U = (u1, u2,… , uM) and V= (v1, v2,… , vN), the distance
between U and V for a given mapping RM,
D’RM(U, V), is defined as ∑ ∈ ∀ M M R j i i j R UV d u v D ) , ( ) , ( = ) , ( .
Definition 3 Given two video shots U = (u1, u2,… , uM) and V= (v1, v2,… , vN), the distance
between U and V for Optimal Mapping (OM) is defined as )} , ( { min ) , (UV D UV D M M R R OM ∀ = .
The solution of DOM(U, V) can be found
based on the approach of dynamic programming. Assume that the shorter shot is U and the longer one is V. Our goal is to find the subsequence of V which is the most similar to U. Let D[i, j] be the minimum cost of mapping between (u1, u2,… , ui) and (v1, v2,… , vj). It is not hard to see that there are
two possibilities:
map: the frame vnis mapped with the frame
um, D[m, n] = D[m-1, n-1] + d(um, vn).
ignore: the frame vn is not selected to be
mapped with the frame um, D[m, n] = D[m,
n-1].
Combining these two cases, we get the following recurrence relation for the solution
of DOM(U, V). − + − − = ] 1 , [ ) , ( ] 1 , 1 [ min ] , [ n m D v u d n m D n m D m n ,
with D[0, j] = 0, for all j, 1≦j≦N, and D[i, 0] = ∞, for all i, 1≦i≦M. Note that the relation RM can be constructed by backtracking the
matrix D[M, N].
Optimal mapping is the one-to-one frame mapping. However, sometimes the key frames are extracted by uniform sampling. It is likely that, in the extracted sequence of key frames, two consecutive key frames are similar. In addition, sometimes, two sequences of key frames are extracted by non-uniform sampling but with different thresholds. Therefore, given two similar shots, more number of key frames are extracted for the shot with lower threshold. It is necessary to measure the sequence similarity based on many-to-many frame mapping.
Definition 4 Given two video shots U = (u1, u2,… , uM), V= (v1, v2,… , vN) a mapping with
replication is a many-to-many relation RMR
from {1, 2, … , M} to {1,2, … , N}, such that (1) for each i, 1≤ i≤M, there exists at least one j, 1≤j≤N, such that (i , j) ∈RMR,
(2) for each j, 1≤j≤N, there exists at least one i, 1≤i≤M, such that (i , j) ∈RMR,
(3) for any two ordered pairs (i , j), (k, l) in RMR, (j≤l) if and only if (i≤k).
Definition 5 Given two video shots U = (u1, u2,… , uM) and V= (v1, v2,… , vN), the distance
between U and V for a given mapping RMR,
D’RMR(U, V), is defined as ∑ ∈ ∀ MR MR R j i j v i u ( d = V U R D ) , ( ) , ) , ( .
Definition 6 Given two video shots U = (u1, u2,… , uM) and V= (v1, v2,… , vN), and d(ui, vj), ∀i, 1≤i≤M, ∀j, 1≤j≤N, the distance between U and V for Optimal Mapping with Replication(OMR) is defined as )} ( { ) ( min R U,V R V , U OMR D D MR MR ∀ =
Similar to DOM(U, V), the solution of
DOMR(U, V) can be found based on the
approach of dynamic programming. There are three possible relations between D[m, n]
and D[i, j] for some combinations of smaller is and js.:
map: the frame vnis mapped with the frame
um, D[m, n] = D[m-1, n-1] + d(um, vn).
replicate vn: the frame vn is replicated to
mapped with the frame um, D[m, n]
= D[m-1, n] + d(um, vn).
replicate um: the frame umis replicated to be
mapped with the frame vn, D[m, n]
= D[m, n-1] + d(um, vn).
Combining these three cases, we get the following recurrence relation for the solution of DOMR(U, V). + − + − + − − = ) , ( ] 1 , [ ) , ( ] , 1 [ ) , ( ] 1 , 1 [ min ] , [ n m n m n m v u d n m D v u d n m D v u d n m D n m D ,
with D[0, 0] = 0, D[0, j] = ∞, for all j, 1 ≤j
≤N, and D[i, 0] = ∞, for all i, 1 ≤i≤M. A similarity measure is asymmetric if D(U, V) ≠ D(V, U). In general, asymmetric similarity measure is used to map between the query sequence of frames and video sequence of frames. The simplest proposed asymmetric similarity measure is the Optimal Subsequence Mapping (OSM). The algorithm of OSM is similar to that of OM except that the query video sequence must be the shorter sequence.
Similar to the similarity measure OMR in symmetric measures, we defined the OSMR in asymmetric measures as follows.
Definition 7 Given the query shot Q = (q1, q2,… , qM), the video shot V = (v1, v2,… , vN),
M ≦ N, a subsequence mapping with replication is a one-to-many relation RSMR
from {1, 2, … , M} to {1,2, … , N}, such that (1) for each i, 1≦i≦M, there exists at least
one j, 1≦j≦N, such that (i, j) ∈RSMR,
(2) for each j, 1≦j≦N, there exists one i, 1 ≦i≦M, such that (i , j) ∈RSMR,
(3) for any two ordered pairs (i, j), (k, l) in RSMR, (j < l) if and only if (i≤k).
Definition 8 Given the query shot Q = (q1, q2,… , qM), the video shot V = (v1, v2,… , vN),
and the distance d(qi, vj), ∀i, 1≦i≦M, ∀j, 1
≦j≦N, the distance between Q and V for a given mapping RSMR, D’RSMR(Q, V), is defined
as ∑ ∈ ∀ SMR SMR R j i i j R QV d q v D ) , ( ) , ( = ) , ( .
Definition 9 Given the query shot Q = (q1, q2,… , qM) and the video shot V = (v1, v2,… , vN), the distance between U and V for
Optimal Subsequence Mapping with
Replication (OSMR) is defined as
)} , ( { min ) , (QV D QV D SMR SMR R R OSMR ∀ =
Sometimes, it is required that the mapped frames are consecutive. Therefore, the constraint of mapping relation is more strict.
Definition 10 Given the query shot Q = ( q1, q2,… , qM), the video shot V = (v1, v2,… , vN),
M ≦N, a consecutive mapping is a one-to-one relation RCM from {1, 2,… , M} to {1,
2,… , N}, such that
(1) for each i, 1 ≦i ≦M, there exists one j, 1 ≦j ≦N, such that (i , j) ∈RCM,
(2) for any two ordered pairs (i , j), (k, l) in RCM, [(j - l) = 1] if and only if [(i - k) = 1].
Definition 11 Given the query shot Q = (q1, q2,… , qM), the video shot V = (v1, v2,… , vN),
and d(qi, vj), ∀ i, 1 ≦i≦M, ∀ j, 1 ≦j≦N,
the distance between Q and V for a given consecutive mapping RCM, D’RCM(Q, V), is defined as ∑ ∈ ∀ CM CM R j i j v i q ( d = V Q R D ) , ( ) , ) , ( .
Definition 12 Given the query shot Q = (q1, q2,… , qM) and the video shot V = (v1, v2,… , vN), the distance between Q and V for
Optimal Consecutive Mapping is defined as )} ( { ) ( min Q,V R R V , Q OCM D D CM CM ∀ = .
Algor ithm Optimal Consecutive Mapping (OCM) for j = 0 to M-N do D[0, j] = 0; for j = 0 to M-N do for i = 1 to N-1 do D[i, i+ j] = D[i-1, i+ j-1]+d(qi, vi+ j); D[M, M] = D[M-1, M-1]+d(qM, vM); for j = M+ 1 to N do D[M, j] = min(D[M-1, j-1]+d(qM, vj), D[M, j-1]); r etur n D[M, N]
Next, we extend the definition of Optimal Consecutive Mapping to Optimal Consecutive Mapping with Replication (OCMR). In OCMR, each frame of query sequence is permitted to map with more than one frame of video sequence. In addition, the
same as OCM, the mapped frames of video sequence must be consecutive.
Definition 13 Given the query video shot Q = ( q1, q2,… , qM), the video shot V = (v1, v2,… , vN), M ≦N, a consecutive matching with
replication is a one-to-many relation RCMR
from {1, 2, … , M} to {1,2, … , N}, such that (1) for each i, 1 ≦i ≦M, there exists at
least one j, 1 ≦j ≦N, such that (i , j) ∈ RCMR,
(2) let pmax(i) be max{ |( , )j i j ∈RCMR},
pmin(i) be min{ |( , )j i j ∈RCMR}, for each
j, pmin(1) ≦j ≦pmax(M), there exists one i,
1 ≦j ≦M, such that (i , j) ∈RCMR,
(3) for any two ordered pairs (i , j), (k, l) in RCMR, if (i < k) then (j < l),
(4) any two ordered pairs (i , j), (k, l) in RCMR,
if [(j - l) = 1] then either [(i - k) = 1] or [(i - k) = 0 ].
Definition 14 Given the query video shot Q = (q1, q2,… , qM), the video shot V = (v1, v2,… , vN) and d(qi, vj), ∀ i, 1 ≦i ≦M, ∀ j, 1 ≦j
≦N, the distance between Q and V for a given consecutive mapping RCMR, D’RCMR(Q,
V), is defined as ∑ ∈ ∀ CMR CMR R j i j v i q ( d = V Q R D ) , ( ) , ) , ( .
Definition 15 Given the query shot Q = (q1, q2,… , qM) and the video shot V = (v1, v2,… , vN), the distance between Q and V for
Optimal Consecutive Mapping with
Replication is defined as )} ( { ) ( min Q,V R R V , Q OCMR D D CMR CMR ∀ = .
The solution of OCMR can be obtained as follows. First, if a mapping R is an OCMR, then the first frame q1 must be mapped with only one frame of video sequence, so does the last query frame qM. Otherwise, suppose
that q1 is mapped with frames va,… vi-1, vi,
qM is mapped with vj, vj-1 … , vb. We can
derive a mapping with less distance by removing the mapping pairs between q1 and va,… vi-1, and those between qM and vj-1 … ,
vb. Therefore the behaviors of q1 and qM are
the same as those of frames in OCM and the behaviors of q2 , q3 , … , qM-1, qMare the same
as those of frames in OSMR.
4. Exper iments
To evaluate the performance of the proposed similarity measures, we have a database of 100 video shots. These video shots were MPEG-II video which digitized from CTS news. The length of these video ranges from 32 to 205 seconds. Five sample query video are selected from the database. For each video, a sequence of key frames were extracted. Each video clip is decompressed first and the key frames are extracted by the process of non-uniform key frames extraction. Each key frame is represented as a 64-bin color histogram in HSV color space. We measure the performance by precision and recall. The ground truth to determine relevant video is judged by humans. Using the proposed similarity measures OM and OMR, the sample query returns a list of candidate video and the precision-recall values are calculated.
Average precision vs. recall (OM)
0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall Precision
Figure 1. The average precision-recall curve for similarity measure OM.
Average precision vs. recall (OMR)
0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall Precision
Figure 2. The average precision-recall curve for similarity measure OMR.
Figure 1 and 2 show the average precision-recall curve for similarity measures OM and OMR respectively. The result shows
that both measures performs well, especially when only one video is returned for query video. In this case, the precision is one. That is, both measures return the most similar video in the first rank. Moreover, it can be seen that OM performs better than OMR. This phenomenon can be realized as follows. OM is the one-to-one mapping and leaves out unmatched frames while OMR is the many-to-many mapping in which all frames are considered. However, from the human perception human point of view, two video shots are similar only if there are similar frames between them. Therefore, OM behaves like human perception.
5. Conclusions
In this project, we have proposed a series of video similarity measures based on similarity of frame sequence. The similarity algorithms based on the approach of dynamic programming are also presented. The experiment results show that similarity measure OM perform well than OMR.
In fact, the performance highly depends on the extraction process of video content. We plan to measure the performance by considering the effect of content extraction process. In addition, in the proposed similarity measures, two dissimilar frames are permitted to be mapped. The similarity measures in which the dissimilarity constraint is imposed need to be investigated.
6. Contr ibution
[1] M. K. Shan and S. Y. Lee, Content-Based Similarity Measures for Video Based on Similarity of Frame Sequence, IEEE IW-MMDBMS’98 International Workshop on Multimedia Data Base Management Systems, Dayton, Ohio, 1998.
[2] M. K. Shan and S. Y. Lee, A Generic Framework for Similarity Measures of Content-Based Video Retrieval, submitted to Pattern Recognition Letters, 1999.
7. Refer ences
[1] N. Dimitrova and F. Golshani, Rx for Semantic
Video Database Retrieval, In Proceedings of ACM Multimedia’94, San Francisco, CA, pp. 219-226, 1994.
[2] M. Flickner et al., Query by Image and Video Content: The QBIC System, IEEE Computer, Vol. 28, No. 9, pp. 23-32, 1995.
[3] A. Gupta and R. Jain, Visual Information Retrieval,
Communications of ACM, Vol. 40, No. 5, pp. 71-79, 1997.
[4] L. A. Rowe , J. S. Boreczky and C. A. Eads, Indexes for User Access to Large Video Databases, In Proceedings. of Storage and Retrieval for Image and Video Databases II, IS&T/SPIE Symposium on Electronic Imaging Science & Technology, San Jose, CA, pp. 150-161, 1994.
[5] J. K. Wu, A. D. Narasimhalu, B. M. Mehtre, C. P. Lam and Y. J. Gao, CORE: A Content-Based Retrieval Engine for Multimedia Information Systems, Multimedia Systems, Vol. 3, No. 1, pp. 25-41, 1995.
[6] M. M. Yeung and B. Liu, Efficient Matching and Clustering of Video Shots, In Proceedings of International Conference on Image Processing’95, Washington, DC, pp. 338-341, 1995.
[7] H. J. Zhang, A. Kankanhalli and W. Smoliar, Automatic Partitioning of Full-Motion Video,
Multimedia Systems, Vol. 1, No. 1, pp. 10-28, 1993.
[8] H. J. Zhang, D. Zhong and S. W. Smoliar, An Integrated System for Content-Based Video Retrieval and Browsing, Pattern Recognition, Vol. 30, No, 4, pp. 643-658, 1997.