Video Sequence Matching - Robust Video Sequence Retrieval Using A Novel Object-Based

Chapter 5. Robust Video Sequence Retrieval Using A Novel Object-Based

5.4 Video Sequence Matching

After video segments are characterized by the descriptor of object-based 2D-histogram, temporal relationships among the moving objects have to be described.

In order to characterize the temporal relationships among moving objects, a few DCT coefficients of the transformed time sequence are used to represent the variations of original objects among consecutive frames. A brief review of DCT will be elaborated in Section 5.4.1. Section 5.4.2 will describe how to represent a video sequence. The similarity metric that can be used to measure the degree of similarity will be discussed in Section 5.4.3.

5.4.1 Discrete Cosine Transform

The DCT (Discrete Cosine Transform) is a powerful tool that has been extensively used in many data compression applications. The DCT of a finite length sequence often has its coefficients more highly concentrated at low indices than other transforms do [67]. It has been proven in [68] that the approximation capability of DCT is much better than that of other approximation methods. Therefore, we shall use the DCT to characterize the temporal variations among moving objects in a video sequence.

5.4.2 Representation of Video Sequences

In this section, we shall describe how to characterize the temporal variations among moving objects exploiting the DCT. The algorithm that can be exploited to generate video sequence representation is as follows:

Video Sequence Representation Algorithm

Detect moving objects by clustering macroblocks that have similar motion vector magnitudes and similar motion directions.

2. For each object Obj^i,r, where i and r denote the rth object in the ith P-frame;

Compute the centroid and the object size in the unit of macroblocks.

3. Set the number of histogram bins to β 4. For each P-frame Pi,

Compute the X-histogram and the Y-histogram according to the horizontal and vertical position of the objects, respectively.

5. For each sequence of histogram bins [Bin_t^Z_,_j], where t∈[ N1, ], j∈[1,β] and }

, {X Y Z∈

Compute the transformed sequence [Z_f_,_j] using the Discrete Cosine Transform

⎟⎠

6. Set the number of DCT coefficients to α.

7. For βtransformed sequences [Z_f_,_j] of DCT coefficients,

Select the DC coefficient and (α-1) AC coefficients to represent a transformed sequence.

Fig. 5-4 is the graphical representation of the above algorithm. For each P-frame, the feature of the object-based motion activity is described by a 2D-histogram, in which the spatial distribution of moving objects in horizontal and vertical direction are characterized by the bin values of the X-histogram and the Y-histogram, respectively. Therefore, a video sequence can be represented by a sequence of 2D-histogram with 2Nβ dimensions, where N is the number of P-frames in a video sequence and β is the number of bins in X-histogram and Y-histogram. In order to reduce the dimensionality of the feature space, DCT is exploited to transform the 2D-histogram of the original video sequence into the frequency domain. The value of the bin of X-histogram ( of Y-histogram) in the ith P-frame is considered to be a signal in time i, and thus the corresponding X-histogram bin in the consecutive N P-frames is regarded as a time signal = [ ] ( = sequence is represented by β sequences of DCT coefficients restricted by the number of bins in the histogram. It means that temporal variations among original objects in the successive P-frames are characterized by β sequences of DCT coefficients in frequency domain.

It is well known that the first few low-frequency AC terms together with the DC term will suffice for the need. Therefore, for easy computation we only choose these

terms to represent a video sequence instead of selecting all coefficients. However, to select an appropriate amount of AC coefficients is always a crucial issue. Since the selection of coefficients is an ill-posed problem, we shall discuss this problem in the experiments.

Fig. 5-4. Video sequences are characterized by the object-based T2D-Histogram descriptor and further represented by reduced low-dimensional DCT coefficients

5.4.3 Choice of Similarity Measure

A very important property of Parseval’s theorem is that the Euclidean distance between DCT transformed signals is able to maintain the local topology. Therefore, for matching between video sequences we employ the modified Euclidean distance as the metric. Let [W_f^X] and [H^X_f ] be two finite point sets of X-histogram ([W^Y_f ] and

]

[H^Y_f of the Y-histogram). Then the modified Euclidean distance between two video sequences w and h is defined as

H are the transformed signals of w and h, respectively. In Eq. (5-4), j denotes the jth histogram bin, f represents the fth coefficient and α denotes the number of selected DCT coefficients. is a bin-rotating function which rotates the β histogram bins to the right n times in a cyclic way. For example, shifts the first (β-1) bins 1 time to the right and the last bin rotates from the βth bin to the 1

( )

the distance metric with function shr(n,H), two video sequences will be regarded as similar when they are spatially and temporally similar. If the function shr(n,H) were not employed in the distance function, a shot A with objects poisoned in the left and a shot B with objects positioned in the right would be regarded as dissimilar because the peak bins of Shots A and B are in the left and right, respectively and thereby the distance between A and B would be very large.

To further address the overall moving trend of objects within a video sequence, and are weighted adaptively based on the average motion vector magnitudes derived from the x- and y-directions. Under these circumstances, the total distance between two video sequences w and h can be defined as

of P-frames, and and are the average motion vector magnitudes of the X-component and Y-component, respectively, of the inter-coded macroblocks in the P-frame. The reason why the analysis on object motion is split into two independent directions is as follows. It is well known that a camera would normally pan or tilt to catch moving objects in a scene. This act will in fact result in the situation that the global motion is mainly horizontal (vertical) when most active regions move in the horizontal (vertical) direction. Therefore, it is feasible to use the dominant moving trend to measure the video similarity. For example, we can discriminate between baseball and football videos using the above mentioned similarity metric because most players in a baseball game run vertically and the camera tilts to track them or the baseball, while players in a football game primarily run horizontally and the camera pans to track significant events.

MVi_, MV_i_,_V

ith

在文檔中高階視訊處理、擷取、特徵粹取及視訊結構化計算之研究 (頁 114-119)