IrenaKoprinska ,SergioCarrato * ! " , Temporalvideosegmentation:Asurvey

24  Download (0)

Full text


* Corresponding author. Tel.: #39 040-676-7147; fax: #39- 040-676-3460.

E-mail addresses: (I. Koprinska), car- (S. Carrato).

Temporal video segmentation: A survey

Irena Koprinska , Sergio Carrato *

Institute for Information Technologies, Acad. G. Bonchev Str., Bl. 29A, 1113 Soxa, Bulgaria

Department of Electrical Engineering and Computer Science (D.E.E.I), Image Processing Laboratory, University of Trieste, via Valerio 10, 34127 Trieste, Italy

Received 27 July 1999; received in revised form 8 February 2000; accepted 15 February 2000


Temporal video segmentation is the "rst step towards automatic annotation of digital video for browsing and retrieval.

This article gives an overview of existing techniques for video segmentation that operate on both uncompressed and compressed video stream. The performance, relative merits and limitations of each of the approaches are comprehens- ively discussed and contrasted. The gradual development of the techniques and how the uncompressed domain methods were tailored and applied into compressed domain are considered. In addition to the algorithms for shot boundaries detection, the related topic of camera operation recognition is also reviewed.  2001 Elsevier Science B.V. All rights reserved.

Keywords: Temporal video segmentation; Shot boundaries detection; Camera operations; Video databases

1. Introduction

Recent advances in multimedia compression technology, coupled with the signi"cant increase in computer performance and the growth of Internet, have led to the widespread use and availability of digital video. Applications such as digital libraries, distance learning, video-on-demand, digital video broadcast, interactive TV, multimedia information systems generate and use large collections of video data. This has created a need for tools that can e$ciently index, search, browse and retrieve rel- evant material. Consequently, several content-

based retrieval systems for organizing and manag- ing video databases have been recently proposed [8,26,34].

As shown in Fig. 1, temporal video segmentation is the "rst step towards automatic annotation of digital video sequences. Its goal is to divide the video stream into a set of meaningful and manage- able segments (shots) that are used as basic elements for indexing. Each shot is then represented by se- lecting key frames and indexed by extracting spatial and temporal features. The retrieval is based on the similarity between the feature vector of the query and already stored video features.

A shot is de"ned as an unbroken sequence of frames taken from one camera. There are two basic types of shot transitions: abrupt and gradual.

Abrupt transitions (cuts) are simpler, they occur in a single frame when stopping and restarting the

0923-5965/01/$ - see front matter  2001 Elsevier Science B.V. All rights reserved.

PII: S 0 9 2 3 - 5 9 6 5 ( 0 0 ) 0 0 0 1 1 - 4


Fig. 2. Dissolve, cut.

Fig. 1. Content-based retrieval of video databases.

camera. Although many kinds of cinematic e!ects could be applied to arti"cially combine two shots, and thus to create gradual transitions, most often fades and dissolves are used. A fade out is a slow decrease in brightness resulting in a black frame;

a fade in is a gradual increase in intensity starting from a black image. Dissolves show one image super- imposed on the other as the frames of the "rst shot get dimmer and those of the second one get brighter.

Fig. 2 shows an example of dissolve and cut. Fade out followed by fade in is presented in Fig. 3.

Gradual transitions are more di$cult to detect than cuts. They must be distinguished from camera operations (Fig. 4) and object movement that ex- hibit temporal variances of the same order and cause false positives. It is particularly di$cult to detect dissolves between sequences involving inten- sive motion [14,44,47].

Camera operation recognition is an important issue also for another reason. As camera operations usually explicitly re#ect how the attention of the viewer should be directed, the clues obtained are useful for key frame selection. For example, when a camera pans over a scene, the entire video se- quence belongs to one shot but the content of the scene could change substantially, thus suggesting the use of more than one key frame. Also, when the camera zooms, the images at the beginning and end of the zoom may be considered as representative of the entire shot. Furthermore, recognizing camera operations allows the construction of salient video stills [38] } static images that e$ciently represent video content.

Algorithms for shot boundaries detection were already discussed in several review papers. Ananger and Little [4] presented a survey in video indexing, including some techniques for temporal video seg- mentation mainly in uncompressed domain. Idris and Panchanathan [15] surveyed methods for con- tent-based indexing in image and video databases focusing on feature extraction. A review of video parsing is presented but it mainly includes methods that operate on uncompressed domain and detect cuts. The goal of this paper is to provide a compre- hensive taxonomy and critical survey of the existing approaches for temporal video segmentation in both uncompressed and compressed video. The perfor- mance, relative merits and shortcomings of each


Fig. 3. Fade out followed by fade in.

Fig. 4. Basic camera operations: "xed, zooming (focal length change of a stationary camera), panning/tilting (camera rotation around its horizontal/vertical axis), tracking/booming (horizon- tal/vertical transverse movement) and dollying (horizontal lat- eral movement).

approach are discussed in detail. A special attention is given to the gradual development and improve- ment of the techniques, their relationships and simil- arities, in particular how the uncompressed domain methods were tailored and imported into the com- pressed domain. In addition to the algorithms for

shot boundaries detection, the related topic of cam- era operation recognition is also discussed.

The paper is organized as follows. In the next section we review shot boundaries detection tech- niques starting with approaches in uncompressed domain and then moving to compressed domain via an introduction to MPEG fundamentals. An overview of methods for camera operation recogni- tion is presented in Section 3. Finally, a summary with future directions concludes the paper.

2. Temporal video segmentation

More than eight years of temporal video segmen- tation research have resulted in a great variety of algorithms. Early work focus on cut detection, while more recent techniques deal with the harder problem } gradual transitions detection.

2.1. Temporal video segmentation in uncompressed domain

The majority of algorithms process uncom- pressed video. Usually, a similarity measure


between successive images is de"ned. When two images are su$ciently dissimilar, there may be a cut. Gradual transitions are found by using cumulative di!erence measures and more sophisti- cated thresholding schemes.

Based on the metrics used to detect the di!erence between successive frames, the algorithms can be divided broadly into three categories: pixel, block- based and histogram comparisons.

2.1.1. Pixel comparison

Pair-wise pixel comparison (also called template matching) evaluates the di!erences in intensity or color values of corresponding pixels in two success- ive frames.

The simplest way is to calculate the absolute sum of pixel di!erences and compare it against a thre- shold [18]:

D(i, i#1)" 6 V 7

W"PG(x, y)!PG>(x, y)"


for gray level images,

D(i, i#1)" 6 V 7


A"PG(x, y, c)!PG>(x, y, c)"


for color images, (1)

where i and i#1 are two successive frames with dimension X;>, PG(x, y) is the intensity value of the pixel at the coordinates (x, y) in frame i, c is index for the color components (e.g.

c3+R, G, B, in case of RGB color system) and PG(x, y, c) is the color component of the pixel at (x, y) in frame i.

A cut is detected if the di!erence D(i, i#1) is above a prespeci"ed threshold ¹. The main dis- advantage of this method is that it is not able to distinguish between a large change in a small area and a small change in a large area. For example, cuts are misdetected when a small part of the frame undergoes a large, rapid change. Therefore, methods based on simple pixel comparison are sensitive to object and camera movements.

A possible improvement is to count the number of pixels that change in value more than some threshold and to compare the total against a sec-

ond threshold [25,45]:

DP(i, i#1, x, y)"

1 if"PG(x, y)!PG>(x, y)"'¹, 0, otherwise,

D(i, i#1)" 6 V 7

WDP(i, i#1, x, y)

X> . (2)

If the percentage of changed pixels D(i, i#1) is greater than a threshold ¹, a cut is detected.

Although some irrelevant frame di!erences are

"ltered out, these approaches are still sensitive to object and camera movements. For example, if camera pans, a large number of pixels can be judged as changed, even though there is actually a shift with a few pixels. It is possible to reduce this e!ect to a certain extent by the application of a smoothing "lter: before the comparison each pixel is replaced by the mean value of its neighbors.

2.1.2. Block-based comparison

In contrast to template matching that is based on global image characteristic (pixel by pixel di!er- ences), block-based approaches use local character- istic to increase the robustness to camera and object movement. Each frame i is divided into b blocks that are compared with their correspond- ing blocks in i#1. Typically, the di!erence be- tween i and i#1 is measured by

D(i, i#1)" @

IcI DP(i, i#1,k), (3)

where cI is a predetermined coe$cient for the block k and DP(i, i#1, k) is a partial match value be- tween the kth blocks in i and i#1 frames.

In [17] corresponding blocks are compared us- ing a likelihood ratio


pI G#pI G>

2 #

kI G>!kI G>


pI G)pI G> , (4) wherepI G, pI G> are the mean intensity values for the two corresponding blocks k in the consecutive frames i and i#1, and pI G, pI G> are their vari- ances, respectively. Then, the number of blocks for which the likelihood ratio is greater than


Fig. 5. Net comparison algorithm: base windows BGH.

a threshold ¹ is counted,

DP(i, i#1, k)"

10 ifotherwise.jI'¹, (5)

A cut is declared when the number of changed blocks is large enough, i.e. D(i, i#1) is greater than a given threshold ¹ and cI"1 for all k.

Compared to template matching, this method is more tolerant to slow and small object motion from frame to frame. On the other hand, it is slower due to the complexity of the statistical formulas. Addi- tional potential disadvantage is that no change will be detected in the case of two corresponding blocks that are di!erent but have the same density func- tion. Such situations, however, are very unlikely.

Another block-based technique is proposed by Shahraray [32]. The frame is divided into 12 non- overlapping blocks. For each of them the best match is found in the respective neighborhoods in the previous image based on image intensity values.

A non-linear order statistics "lter is used to com- bine the match values, i.e. the weight of a match value in Eq. (3) will depend on its order in the match value list. Thus, the e!ect of camera and object movements is further suppressed. The author claims that such similarity measure of two images is more consistent with human judgement.

Both cuts and gradual transitions are detected.

Cuts are found using thresholds like in the other approaches that are discussed while gradual transitions are detected by identifying sustained low-level increase in match values.

Xiong et al. [41] describe a method they call net comparison, which attempts to detect cuts inspect- ing only part of the image. It is shown that the error will be low enough if less than half of so called base windows (non-overlapping square blocks, Fig. 5) are checked. Under an assumption about the lar- gest movement between two images, the size of the

windows can be chosen large enough to be indi!er- ent to a non-break change and small enough to contain the spatial information as much as pos- sible. Base windows are compared using the di!er- ence between the mean values of their gray-level or color values. If this di!erence is larger than a thre- shold, the region is considered changed. When the number of changed windows is greater than an- other threshold, a cut is declared. The experiments demonstrated that the approach is faster and more accurate than pixel pair-wise, likelihood and local histogram methods. In their subsequent paper [40], the idea of video subsampling into space is further extended to subsampling in both space and time.

The new Step-variable algorithm detects both abrupt and gradual transition comparing frames i and j, where j"i#myStep. If no signi"cant change is found between them, the move is with half step forward and the next comparison is between i#myStep/2 and j#myStep/2. Otherwise, binary search is used to locate the change. If i and j are successive and their di!erence is bigger than a thre- shold, cut is declared. Otherwise, edge di!erences between the two frames are compared against an- other threshold to check for gradual transition.

Obviously, the performance depends on the proper setting of myStep: large steps are e$cient but in- crease the number of false alarms, too small steps may result in missing gradual transition. In addi- tion, the approach is very sensitive to object and camera motion.

2.1.3. Histogram comparison

A step further towards reducing sensitivity to camera and object movements can be done by comparing the histograms of successive images.

The idea behind histogram-based approaches is that two frames with unchanging background and unchanging (although moving) objects will have little di!erence in their histograms. In addition, histograms are invariant to image rotation and change slowly under the variations of viewing angle and scale [35]. As a disadvantage one can note that two images with similar histograms may have com- pletely di!erent content. However, the probability for such events is low enough, moreover techniques for dealing with this problem have already been proposed in [28].


A gray level (color) histogram of a frame i is an n-dimensional vector HG( j), j"1,2, n, where n is the number of gray levels (colors) and H( j) is the number of pixels from the frame i with gray level (color) j. Global histogram comparison. The simplest approach uses an adaptation of the metrics from Eq. (1): instead of intensity values, gray level histo- grams are compared [25,39,45]. A cut is declared if the absolute sum of histogram di!erences between two successive frames D(i, i#1) is greater than a threshold ¹,

D(i, i#1)" L

H"HG( j)!HG>( j)", (6) where HG( j) is the histogram value for the gray level j in the frame i, j is the gray value and n is the total number of gray levels.

Another simple and very e!ective approach is to compare color histograms. Zhang et al. [45] apply Eq. (6) where j, instead of gray levels, denotes a code value derived from the three color intensities of a pixel. In order to reduce the bin number (3 colors

;8 bits create histograms with 2 bins), only the upper two bits of each color intensity are used to compose the color code. The comparison of the resulting 64 bins has been shown to give su$cient accuracy.

To enhance the di!erence between two frames across a cut, several authors [25] propose the use of thes test to compare the (color) histograms HG( j) and HG>( j) of the two successive frames i and i#1,

D(i, i#1)" L H

"HG( j)!HG>( j)"

HG>( j) . (7)

When the di!erence is larger than a given threshold

¹, a cut is declared. However, experimental results reported in [45] show thats test not only enhan- ces the di!erence between two frames across a cut but also increases the di!erence due to camera and object movements. Hence, the overall performance is not necessarily better than the linear histogram comparison represented in Eq. (6). In addition, s statistics requires more computational time.

Gargi et al. [12] evaluate the performance of three histogram based methods using six di!erent color coordinate systems: RGB, HS<, >IQ,

¸HaHbH, ¸HuHvH and Munsell. The RGB histogram of a frame is computed as three sets of 256 bins. The other "ve histograms are represented as a 2-dimen- sional distribution over the two non-intensity based dimensions of the color spaces, namely:

H and S for the HS<, I and Q for the >IQ, aH and bH for the ¸HaHbH, uH and vH for the ¸HuHvH and hue and chroma components for the Munsell space.

The number of bins is 1600 (40;40) for the

¸HaHbH, ¸HuHvH and >IQ histograms and 1800 (60 hues;30 saturations/chromas) for the HS< and Munsell space histograms. The di!erence functions used to compare histograms of two consecutive frames are de"ned as follows:

E bin-to-bin di!erences as in Eq. (6) E histogram intersection:

D(i, i#1)"1!Intersection(HG, HG>)

"1! LHmin(HG( j)!HG>( j))

LHmax(HG( j)!HG>( j)). (8) Note that for two identical histograms the intersec- tion is 1 and the di!erence 0 while for two frames which do not share even a single pixel of the same color (bin), the di!erence is 1.

E weighted bin di!erences

D(i, i#1)" L H


=(k) ) (HG( j)!HG(k)), (9) where N(k) is a neighborhood of bin j and =(k) is the weight value assigned to that neighbor. A 3;3 or 3 neighborhoods are used in the case of 2- dimensional and 1-dimensional histograms, respec- tively.

It is found that in terms of overall classi"cation accuracy >IQ, ¸HaHbH and Munsell color coordi- nate spaces perform well, followed by HSV, ¸HuHvH and RGB. In terms of computational cost of con- version from RGB, the HS< and >IQ are the least expensive, followed by ¸HaHbH, ¸HuHvH and the Munsell space.

So far only histogram comparison techniques for cut detection have been presented. They are based on the fact that there is a big di!erence between the


Fig. 6. Twin comparison: (a) consecutive and (b) accumulated histogram di!erences.

frames across a cut that results in a high peak in the histogram comparison and can be easily detected using one threshold. However, such one-threshold based approaches are not suitable to detect gradual transitions. Although during a gradual transition the frame to frame di!erences are usually higher than those within a shot, they are much smaller than the di!erences in the case of cut and cannot be detected with the same threshold. On the other hand, object and camera motions might entail big- ger di!erences than the gradual transition. Hence, lowering the threshold will increase the number of false positives. Below we review a simple and e!ective two-thresholds technique for gradual transition recognition.

The twin-comparison method [45] takes into ac- count the cumulative di!erences between frames of the gradual transition. In the "rst pass a high thre- shold ¹ is used to detect cuts as shown in Fig. 6(a).

In the second pass a lower threshold ¹ is em- ployed to detect the potential starting frame F of a gradual transition. F is then compared to sub- sequent frames (Fig. 6(b)). This is called an accumu- lated comparison as during a gradual transition this di!erence value increases. The end frame F of the transition is detected when the di!erence be- tween consecutive frames decreases to less than ¹ , while the accumulated comparison has increased to a value higher than ¹. If the consecutive di!erence falls below ¹ before the accumulated di!erence exceeds ¹, then the potential start frame F is dropped and the search continues for other gradual transitions. It was found, however, that there are some gradual transitions during which the con- secutive di!erence falls below the lower threshold.

This problem can be easily solved by setting a toler-

ance value that allows a certain number of con- secutive frames with low di!erence values before rejecting the transition candidate. As it can be seen, the twin-comparison detects both abrupt and grad- ual transitions at the same time. Boreczky and Rowe [6] compared several temporal video seg- mentation techniques on real video sequences and found that twin comparison is a simple algorithm that works very well. Local histogram comparison. As it was al- ready discussed, histogram-based approaches are simple and more robust to object and camera movements but they ignore the spatial information and, therefore, fail when two di!erent images have similar histograms. On the other hand, block-based comparison methods make use of spatial informa- tion. They typically perform better than pair-wise pixel comparison but are still sensitive to camera and object motion and are also computationally expensive. By integrating the two paradigms, false alarms due to camera and object movement can be reduced while enough spatial information is re- tained to produce more accurate results.

The frame-to-frame di!erence of frame i and frame i#1 is computed as

D(i, i#1)" @ I

DP(i, i#1, k),

(10) DP(i, i#1, k)"L

H "HG( j, k)!HG>( j, k)",

where HG( j, k) denotes the histogram value at gray level j for the region (block) k and b is the total number of the blocks.

For example, Nagasaka and Tanaka [25] com- pare several statistics based on gray-level and color pixel di!erences and histogram comparisons. The best results were obtained by breaking the image into 16 equal-sized regions, usings test on color histograms for these regions and discarding the largest di!erences to reduce the e!ects of noise, object and camera movements.

Another approach based on local histogram comparison is proposed by Swanberg et al. [36].

The partial di!erence DP(i, i#1, k) is measured by comparing the color RGB histograms of the blocks


using the following equation:

DP(i, i#1, k)"

AZ+0 %, L J


HAG(l)!HAG>(l) . (11) Then, Eq. (3) is applied where cI is 1/b for all k. Lee and Ip [22] introduce a selective HSV histogram comparison algorithm. In order to reduce the frame-to-frame di!erences caused by change in intensity or shade, image blocks are compared in HS< (hue, saturation, value) color space. It is the use of hue that makes the algorithm insensitive to such changes since hue is independent of saturation and intensity. However, as hue is unstable when the saturation or the value are very low, selective com- parison is proposed. If a pixel contains rich color information (i.e. a high < and a high S), it is classi-

"ed into a discrete color based on its hue (Hue), otherwise on its intensity value (Gray). The selec- tive histograms HG (h, k), H G (g, k) and the frame-to-frame di!erence for the block k with di- mensionality X;> are formulated as follows:

HG (h, k)"6 V

7 W

IG (x, y, h),

H G (g, k)" 6 V

7 W

I G (x, y, g), IG (x, y, h)


1 if SG(x, y, h)'¹Q and <G(x, y, h)'¹J, 0 otherwise,

I G (x, y, g)


1 if (SG(x, y, g))¹Q or <G(x, y, g))¹J), 0 otherwise,

D(i, i#1, k)" ,

F "HG (h, k)!HG>(h, k)"

# +

E"H G (g, k)!H G>(g, k)", (12) where h and g are indexes for the hue and gray levels, respectively; ¹Q and ¹T are thresholds and x, y are pixel coordinates.

To further improve the algorithm by increasing the di!erences across a cut, local histogram com- parison is performed. It is shown that the algorithm outperforms both histogram (gray level global and

local) and pixel di!erences based approaches. How- ever, none of the algorithms gives satisfactory performance on very dark video images.

2.1.4. Clustering-based temporal video segmentation

The approaches discussed so far rely on suitable thresholding of similarities between successive frames. However, the thresholds are typically high- ly sensitive to the type of input video. This draw- back is overcome in [13] by the application of unsupervised clustering algorithm. More speci"- cally, the temporal video segmentation is viewed as a 2-class clustering problem (`scene changea and

`no scene changea) and the well-known K-means algorithm [27] is used to cluster frame dissimilar- ities. Then the frames from the cluster `scene changea which are temporary adjacent are labeled as belonging to a gradual transition and the other frames from this cluster are considered as cuts. Two similarity measures based on color histograms were used: s statistics and the histogram di!erence de"ned in Eq. (6), both in RGB and >;< color spaces. The experiments show that the s->;<

detects the larger number of correct transitions but the histogram di!erence->;< is the best choice in terms of overall performance (i.e. number of false alarms and correct detections). As a limitation we can note that the approach is not able to recognize the type of the gradual transitions. The main ad- vantage of the clustering-based segmentation is that it is a generic techniques that not only elimin- ates the need for threshold setting but also allows multiple features to be used simultaneously to im- prove the performance. For example, in their sub- sequent work Ferman and Tekalp [10] incorporate two features in the clustering method: histogram di!erence and pair-wise pixel comparison. It was found that when "ltered these features supplement one another, which results in both high recall and precision. A technique for clustering-based tem- poral segmentation on-the-#y was introduced as well.

2.1.5. Feature based temporal video segmentation An interesting approach for temporal video seg- mentation based on features is described by Zabih et al. [44]. It involves analyzing intensity edges


between consecutive frames. During a cut or a dis- solve, new intensity edges appear far from the loca- tions of the old edges. Similarly, old edges disappear far from the location of new edges. Thus, by counting the entering and exiting edge pixels, cuts, fades and dissolves are detected and classi"ed.

To obtain better results in case of object and cam- era movements, an algorithm for motion compen- sation is also included. It "rst estimates the global motion between frames that is then used to align the frames before detecting entering and exiting edge pixels. However, this technique is not able to handle multiple rapidly moving objects. As the authors have pointed out, another weakness of the approach are the false positives due to the limita- tions of the edge detection method. In particular, rapid changes in the overall shot brightness, and very dark or very light frames, may cause false positives.

2.1.6. Model driven temporal video segmentation The video segmentation techniques presented so far are sometimes referred to as data driven, bottom}up approaches [14]. They address the prob- lem from data analysis point of view. It is also possible to apply top}down algorithms that are based on mathematical models of video data. Such approaches allow a systematic analysis of the prob- lem and the use of several domain-speci"c con- straints that might improve the e$ciency.

Hampapur et al. [14] present a shot boundaries identi"cation approach based on the mathematical model of the video production process. This model was used as a basis for the classi"cation of the video edit types (cuts, fades, dissolves).

For example, fades and dissolves are chromatic edits and can be modeled as

S(x, y, t)"S(x, y, t)(1!RJ)#S(x, y, t)(1!RJ), (13) where S(x, y, t) and S(x, y, t) are two shots that are being edited, S(x, y, t) is the edited shot and l, l are the number of frames for each shot during the edit.

The taxonomy along with the models are then used to identify features that correspond to the di!erent classes of shot boundaries. Finally, feature

vectors are fed into a system for frames classi"ca- tion and temporal video segmentation. The ap- proach is sensitive to camera and object motion.

Another model-based technique, called di!eren- tial model of motion picture, is proposed by Aig- rain and Joly [1]. It is based on the probabilistic distribution of di!erences in pixel values between two successive frames and combines the following factors: (1) a small amplitude additive zero-centered Gaussian noise that models camera, "lm, digitizer and other noises; (2) an intrashot change model for pixel change probability distribution resulting from object and camera motion, angle, focus and light change; (3) a shot transition model for the di!erent types of abrupt and gradual transitions. The histo- gram of absolute values of pixel di!erences is com- puted and the number of pixels that change in value within a certain range determined by the models is counted. Then shot transitions are detected by examining the resulting integer sequences. Experi- ments show 94}100% accuracy for cuts and 80%

for gradual transitions detection.

Yu et al. [43] present an approach for gradual transitions detection based on a model of intensity changes during fade out, fade in and dissolve. At the "rst pass, cuts are detected using histogram comparison. The gradual transitions are then de- tected by examining the frames between the cuts using the proposed model of their characteristics.

For example, it was found that the number of edge pixels have a local minimum during a gradual transition. However, as this feature exhibits the same behavior in case of zoom and pan, additional characteristics of the fades and dissolves need to be used for their detection. During a fade, the begin- ning and end image is a constant image. Hence the number of edge pixels will be close to zero. Further- more, the number of edge pixels gradually increases going away from the minimum in either side. In order to distinguish dissolves, the so called double chromatic di!erence curve is examined. It is based on the idea that the frames of a dissolve can be recovered using the beginning and end frames. The approach has low computational requirements but works under the assumption of small object movement.

Boreczky and Wilcox [7] use hidden Markov models (HMM) for temporal video segmentation.


Table 1

Six groups of approaches for temporal video segmentation in compressed domain based on the information used


Information used 1 2 3 4 5 6

DCT coe$cients  

DC terms  

MB coding mode    



Separate states are used to model shot, cut, fade, dissolve, pan and zoom. The arcs between states model the allowable progressions of states. For example, from the shot state it is possible to go to any of the transition states, but from a transition state it is only possible to return to a shot state.

Similarly, the pan and zoom states can only be reached from the shot state, since they are subsets of the shot. The arcs from a state to itself model the length of time the video is in that particular state.

Three di!erent types of features (image, audio and motion) are used: (1) a standard gray-level histo- gram distance between two adjacent frames; (2) an audio distance based on the acoustic di!erence in intervals just before and just after the frames and (3) an estimate of object motion between the two frames. The parameters of the HMM, namely the transition probabilities associated with the arcs and the probability distributions of the features asso- ciated with the states, are learned by training with the Baum}Welch algorithm. Training data consists of features vectors computed for a collection of video and labeled as one of the following classes:

shot, cut, fade, dissolve, pan and zoom. Once the parameters are trained, segmenting the video is performed using the Viterbi algorithm, a standard technique for recognition in HMM.

Thus, thresholds are not required as the para- meters are learned automatically. Another advant- age of the approach is that HMM framework allows any number of features to be included in a feature vector. The algorithm was tested on di!er- ent video databases and has been shown to im- prove the accuracy of the temporal video segmentation in comparison to the standard thre- shold-based approaches.

2.2. Temporal video segmentation in MPEG compressed domain

The previous approaches for video segmentation process uncompressed video. As nowadays video is increasingly stored and moved in compressed for- mat (e.g. MPEG), it is highly desirable to develop methods that can operate directly on the encoded stream. Working in the compressed domain o!ers the following advantages. First, by not having to perform decoding/re-encoding, computational

complexity is reduced and savings on decom- pression time and decompression storage are ob- tained. Second, operations are faster due to the lower data rate of compressed video. Last but not least, the encoded video stream already contains a rich set of pre-computed features, such as motion vectors (MVs) and block averages, that are suitable for temporal video segmentation.

Several algorithms for temporal video segmenta- tion in the compressed domain have been reported.

According to the type of information used (see Table 1), they can be divided into six non-overlap- ping groups } segmentation based on (1) DCT coe$cients; (2) DC terms; (3) DC terms, macro- block (MB) coding mode and MVs; (4) DCT coe$- cients, MB coding mode and MVs; (5) MB coding mode and MVs and (6) MB coding mode and bit-rate information. Before reviewing each of them, we present a brief description of the funda- mentals of MPEG compression standard.

2.2.1. MPEG stream

The Moving Picture Expert Group (MPEG) standard is the most widely accepted international standard for digital video compression. It uses two basic techniques: MB-based motion compensation to reduce temporal redundancy and transform domain block-based compression to capture spa- tial redundancy. An MPEG stream consists of three types of pictures } I, P and B } which are combined in a repetitive pattern called group of picture (GOP). Fig. 7 shows a typical GOP and the predictive relationships between the di!erent types of frames.


Fig. 8. Intra coding.

Fig. 7. Typical GOP and predictive relationships between I, P and B pictures.

Intra (I) frames provide random access points into the compressed data and are coded using only information present in the picture itself by Discrete Cosine Transform (DCT), Quantization (Q), Run Length Encoding (RLE), and Hu!man entropy coding, see Fig. 8. The "rst DCT coe$cient is called DC term and is 8 times the average intensity of the respective block.

P (predicted) frames are coded with forward motion compensation using the nearest previous reference (I or P) pictures. Bi-directional (B) pic- tures are also motion compensated, this time with respect to both past and future reference frames. In the case of motion compensation, for each 16;16 MB of the current frame the encoder

"nds the best matching block in the respective reference frame(s), calculates and DCT-encodes the residual error and also transmits one or two MVs, see Figs. 9 and 10. During the encoding process a test is made on each MB of P and B frame to see if it is more expensive to use motion compen- sation or intra coding. The latter occurs when the

current frame does not have much in common with the reference frame(s). As a result each MB of a P frame could be coded either intra or forward while for each MB of a B frame there are four possibilities: intra, forward, backward or interpo- lated. For more information about MPEG see [16].

2.2.2. Temporal video segmentation based on DCT coezcients

The pioneering work on video parsing directly in compressed domain is conducted by Arman et al.

[5] who proposed a technique for cut detection based on the DCT coe$cients of I frames. For each frame a subset of the DCT coe$cients of a subset of the blocks is selected in order to construct a vector

<G"+c, c, c, 2,. <G represents the frame i from the video sequence in the DCT space. The nor- malized inner product is then used to "nd the di!erence between frames i and i#u,

D(i, i#u)"<G ) <G>P

"<G""<G>P". (14) A cut is detected if 1!"D(i, i#u)"'¹, where

¹is a threshold. In order to reduce false positives due to camera and object motion, video cuts are examined more closely using a second threshold

¹ (0(¹(¹(1). If ¹(1!"D(i, i#u)"(

¹, the two frames are decompressed and exam- ined by comparing their color histograms.

Zhang et al. [46] apply a pair-wise comparison technique to the DCT coe$cients of correspond- ing blocks of video frames. The di!erence metric is similar to pixel comparisons [25,45], see Section 2.1.1. More speci"cally, the di!erence of


Fig. 9. Forward prediction for P frames.

Fig. 10. Interpolated prediction for B frames.

block l from two frames which areu frames apart is given by

DP(i, i#u, l)

"1 64


"cJ I(i)!cJ I(i#u)"

max[cJ I(i), cJ I(i#u)]'¹, (15) where cJ I(i) is the DCT coe$cient of block l in the frame i, k"1, 2, 64 and l depends on the size of the frame.

If the di!erence exceeds a given threshold ¹, the block l is considered to be changed. If the number of changed blocks is larger than another threshold

¹, a transition between the two frames is declared.

The pair-wise comparison requires far less compu- tation than the di!erence metric used by Arman.

The processing time can be further reduced by applying Arman's method of using only a subset of coe$cients and blocks.

It should be noted that both of the above algo- rithms may be applied only to I frames of the MPEG compressed video, as they are the frames fully encoded with DCT coe$cients. As a result, the processing time is signi"cantly reduced but the temporal resolution is low. In addition, due to the

loss of the resolution between the I frames, false positives are introduced and, hence, the classi"ca- tion accuracy decreases. Also, neither of the two algorithms can handle gradual transitions or false positives introduced by camera operations and ob- ject motion.

2.2.3. Temporal video segmentation based on DC terms

For temporal video segmentation in MPEG compressed domain the most natural solution is to use the DC terms as they are directly related to the pixel domain, possibly reconstructing them for P and B frames, when only DC terms of the residual errors are available. Then, analogous to the uncom- pressed domain methods, the changes between suc- cessive frames are evaluated by di!erence metrics and the decision is taken by complex thresholding.

For example, Yeo and Liu [42] propose a method where so called DC-images are created and compared. DC-images are spatially reduced versions of the original images: the (i, j) pixel of the DC-image is the average value of the (i, j) block of the image (Fig. 11).

As each DC term is a scaled version of the block's average value, DC-images can be constructed from DC terms. The DC terms of I frames are directly available in the MPEG stream while those of B and P frames are estimated using the MVs and DCT coe$cients of previous I frames. It should be noted that the reconstruction techniques is computation- ally very expensive } in order to compute the DC term of a reference frame (DC) for each block, eight 8;8 matrix multiplications and 4 matrix summations are required. Then, the pixel di!er- ences of DC-images are compared and a sliding window is used to set the thresholds because the shot transition is a local activity.

In order to "nd a suitable similarity measure, the authors compare metrics based on pixel di!erences and color histograms. They con"rm that when full images are compared, the "rst group of metrics is more sensitive to camera and object movements but computationally less expensive than the second one. However, when DC-images are compared, pixel-di!erences-based metrics give satisfactory re- sults as DC-images are already smoothed versions of the corresponding full images. Hence, as in the


Fig. 11. A full image (352;288 pixels) and its DC image (44;36 pixels).

Fig. 12. gL and DEL(l, l#k) in the dissolve detection algorithm of Yeo and Liu.

pixel domain approaches (e.g. Eq. (1)), abrupt transitions are detected using a similarity measure based on the sum of absolute pixel di!erences of two consecutive frames (DC-images in this case):

D(l, l#1)"


("PJ(i, j)!PJ>(i, j)"), (16) where l and l#1 are two consecutive DC-images and PJ(i, j) is the intensity value of the pixel in lth DC-image at the coordinates (i, j).

In contrast to the previous methods for cut de- tection that apply global thresholds on the di!er- ence metrics, Yeo and Liu propose to use local thresholds as scene changes are local activities in the temporal domain. In this way false positives due to signi"cant camera and object motions are reduced. More speci"cally, a sliding window is used to examine m successive frame di!erences. A cut between frames l and l#1 is declared if the follow- ing two conditions are satis"ed: (1) D(l, l#1) is the maximum within a symmetric sliding window of size 2m!1 and (2) D(l, l#1) is n times the second largest maximum in the window. The sec- ond condition guards against false positives due to fast panning or zooming and camera #ashes that typically manifest themselves as sequences of large di!erences or two consecutive peaks, respectively.

The size of the sliding window m is set to be smaller than the minimum duration between two transitions, while the values of n typically range from 2 to 3.

Gradual transitions are detected by comparing each frame with the following kth frame where k is larger than the number of frames in the gradual transition. A gradual transition gL in the form of linear transition from c to c in the time interval (a, a) is modeled as


c, n(a,a!ac!c(n!a)#c, a)n(a, c, n*a.


Then if k'a!a, the di!erence between frames l and l#k from the transition gL will be

DEL(l, l#k)"

0, n("c!c"a!k,

"a!a"[n!(a!k)], a!k)n(a!k,

"c!c", a!k)n(a,


"a!a"(n!a), a)n(a, 0, n*a.

(18) As DEL(l, l#k) corresponds to a symmetric plateau with sloping sides (see Fig. 12), the goal of the gradual transition detection algorithm is to identify such plateau patterns. The algorithm of Yeo and Liu needs 11 parameters to be speci"ed.

In [33] shots are detected by color histogram comparison of DC term images of consecutive frames. Such images are formed by the DC terms of the DCT coe$cients for a frame. DC terms of I pictures are taken directly from the MPEG stream, while those for P and B frames are recon- structed by the following fast algorithm. First, the DC term of the reference image (DC) is approxi- mated using the weighted average of the DC terms


Fig. 13. DC term estimation in the method of Shen and Delp.

Fig. 14. Histogram di!erence diagram ((*) cut; (- - - -) dissolve).

of the blocks pointed by the MVs, Fig. 13:

DC"1 64

?Z#N? DC? , (19)

where DC? is the DC term of block a, E is the collection of all blocks that are overlapped by the reference block and N? is the number of pixels in blocka that is overlapped by the reference block.

Then, the approximated DC terms of the pre- dicted pictures are added to the encoded DC terms of the di!erence images in order to form the DC terms of P and B pictures,


(only forward or only backward prediction), DC"DC #(DC#DC)

(interpolated prediction). (20) In this way the computations are reduced to at most 4 scalar multiplications and 3 scalar summa- tions for each block to determine DC.

The histogram di!erence diagram is generated using the measure from Eq. (6) comparing DC term images. As it can be seen from Fig. 14, a break is represented by a single sharp pulse and a dissolve entails a number of consecutive medium-heighted pulses. Cuts are detected using a static threshold.

For the recognition of gradual transitions, the his- togram di!erence of the current frame is compared with the average of the histogram di!erences of the previous frames within a window. If this di!erence is n times larger than the average value, a possible start of a gradual transition is marked. The same value of n is used as a soft threshold for the follow- ing frames. End of the transition is declared when the histogram di!erence is lower than the thre- shold. Since during a gradual transition not all of

the histogram di!erences may be higher than the soft threshold, similarly to the twin comparison, several frames are allowed to have lower di!erence as long as the majority of the frames in the transition have higher magnitude than the soft threshold.

As only the DC terms are used, the computation of the histograms is 64 times faster than that using the original pixel values. The approach is not able to distinguish rapid object movement from gradual transition. As a partial solution, a median "lter (of size 3) is applied to smooth the histogram di!er- ences when detecting gradual transitions. There are 7 parameters that need to be speci"ed.

An interesting extension of the previous ap- proach is proposed by Taskiran and Delp [37].

After the DC term image sequence and the luminance histogram for each image are obtained, a two-dimensional feature vector is extracted from each pair of images. The "rst component is the dissimilarity measure based on the histogram inter- section of the consecutive DC term images, xG"1!Intersection(HG, HG>)

" LHmin(HG( j), HG>( j))

LHHG>( j) , (21)

where HG( j) is the luminance histogram value for the bin j in frame i and n is the number of bins used.

Note that the de"nition of the histogram inter- section is slightly di!erent from that used in [12, Section].

The second feature is the absolute value of the di!erence of standard deviations p for the luminance component of the DC term images, i.e. xG""pG!pG>". The so called generalized se- quence trace d for a video stream composed of n frames is de"ned as dG"""xG!xG>"", i"1,2, n.


Fig. 15. Video shot detection scheme of Patel and Sethi.

These features are chosen not only because they are easy to extract. Combining histogram-based and pixel-based parameters makes sense as they complement some of their disadvantages. As it was discussed already, pixel-based techniques give false alarms in case of camera and object movements.

On the other hand, histogram-based techniques are less sensitive to these e!ects but may miss shot transition if the luminance distribution of the frames do not change signi"cantly. It is shown that there are di!erent types of peaks in the generalized trace plot: wide, narrow and middle corresponding to a fade out followed by a fade in, cuts and dis- solves, respectively. Then, in contrast to the other approaches that apply global or local thresholds to detect the shot boundaries, Taskiran and Delp pose the problem as a one-dimensional edge detection and apply a method based on mathematical morphology.

Patel and Sethi [29,30] use only the DC compo- nents of I frames. In [30] they compute the inten- sity histogram for the DC term images and compare them using three di!erent statistics:

Yakimovski likelihood ratio, s test and Kol- mogorov}Smirnov statistics. The experiments show thats test gives satisfactory results and out- performs the other techniques. In their consequent paper [29], Patel and Sethi compare local and global histograms of consecutive DC term images usings test, Fig. 15.

The local row and column histograms XG and

>H are de"ned as follows:

XG"1 M


Hb (i, j), >H"1 N


Hb (i, j), (22) where b (i, j) is the DC term of block (i, j), i"1,2, N, j"1,2, M. The outputs of the

s test are combined using majority and average comparison in order to detect abrupt and gradual transitions.

As only I frames are used, the DC recovering is eliminated. However, the temporal resolution is low as in a typical GOP every 12th frame is an I frame and, hence, the exact shot boundaries can- not be labeled.

2.2.4. Temporal video segmentation based on DC terms and MB coding mode

Meng et al. [24] propose a shot boundaries de- tection algorithm based on the DC terms and the type of MB coding, Fig. 16. DC components only for P frames are reconstructed. Gradual transitions are detected by calculating the variancep of the DC term sequence for I and P frames and looking for parabolic shapes in this curve. This is based on the fact that if gradual transitions are linear mix- ture of two video sequences f and f with intensity variances p and p, respectively, and are char- acterized by f (t)"f(t)[1!a(t)]#f(t)a(t) where a(t) is a linear parameter, then the shape of the vari- ance curve is parabolic: p(t)"(p#p)a(t)!

2pa(t)#p. Cuts are detected by the computation of the following three ratios:


forw, R@"back

forw, RD"forw

back, (23) where intra, forw and back are the number of MBs in the current frame that are intra, forward and backward coded, respectively.

If there is a cut on a P frame, the encoder cannot use many MBs from the previous anchor frame for motion compensation and as a result many MBs will be coded intra. Hence, a suspected cut on P frame is declared if RN peaks. On the other hand,


Fig. 16. Shot detection algorithm of Meng et al.

if there is a cut on a B frame, the encoding will be mainly backward. Therefore, a suspected cut on B frame is declared if there is a peak in R@. An I frame is a suspected cut frame if two conditions are satis"ed: (1) there is a peak in "*p" for this frame and (2) the B frames before I have peaks in RD. The "rst condition is based on the observation that the intensity variance of the frames during a shot is stable, while the second condition prevents false positives due to motion.

This technique is relatively simple, requires min- imum encoding and produces good accuracy. The total number of parameters needed to implement this algorithm is 7.

2.2.5. Temporal video segmentation based on DCT coezcients, MB coding mode and MVs

A very interesting two-pass approach is taken by Zhang et al. [47]. They "rst locate the regions of potential transitions, camera operations and object motion, applying the pair-wise DCT coe$cients comparison of I frames (Eq. (15)) as in their pre- vious approach (see Section 2.2.2). The goal of the second pass is to re"ne and con"rm the break points detected by the "rst pass. By checking the number of MVs M for the selected areas, the exact cut locations are detected. If M denotes the number of MVs in P frames and the smaller of the numbers of forward and backward nonzero MVs in B frames, then M(¹ (where ¹ is a threshold close to zero) is an e!ective indicator of a cut before or after the B and P frame. Gradual transitions are found by an adaptation of the twin comparison algorithm utilizing the DCT di!erences of I frames.

By MV analysis (see Section 3.1 for more details),

though using thresholds, false positives due to pan and zoom are detected and discriminated from gradual transitions.

Thus, the algorithm only uses information dir- ectly available in the MPEG stream. It o!ers high processing speed due to the multipass strategy, good accuracy and also detects false positives due to pan and zoom. However, the metric for cut detection yields false positives in the case of static frames. Also, the problem of how to distinguish object move- ments from gradual transitions is not addressed.

2.2.6. Temporal video segmentation based on MB coding mode and MVs

In [21] cuts, fades and dissolves are detected only using MVs from P and B frames and information about MB coding mode. The system follows a two- pass scheme and has a hybrid rule-based/neural structure. During the rough scan peaks in the num- ber of intra coded MBs in P frames are detected.

They can be sharp (Fig. 17) or gradual with speci"c shape (Fig. 18) and are good indicators of abrupt and gradual transitions, respectively.

The solution is then re"ned by a precise scan over the frames of the respective neighborhoods.

The `simplera boundaries (cuts and black fade edges) are recognized by the rule-based module, while the decisions for the `complexa ones (dis- solves and non-black fade edges) are taken by the neural part. The precise scan also reveals cuts that remain hidden for the rough scan, e.g. B, I, B

and B in Fig. 17. The rules for the exact cut location are based on the number of backward and forward MBs while those for the fades black edges detection use the number of interpolated and


Fig. 17. Cuts: (a) video structure, (b) number of intra-coded MBs for P frames.

Fig. 18. Fade out, fade in, dissolve: (a) video structure, (b) number of intra-coded MBs for P frames.

backward coded MBs. There is only one threshold in the rules that is easy to set and not sensitive to the type of video. The neural network module learns from pre-classi"ed examples in the form of MV patterns corresponding to the following 6 classes: stationary, pan, zoom, object motion, tracking and dissolve. It is used to distinguish dis- solves from object and camera movements, "nd the exact location of the `complexa boundaries of the gradual transition and further divide shots into sub-shots. For more details about the neural net- work see Section 3.3.

The approach is simple, fast, robust to camera operations and very accurate when detecting the exact locations of cuts, fades and simple dissolves.

However, sometimes dissolves between busy se- quences are recognized as object movement or their boundaries are not exactly determined.

2.2.7. Temporal video segmentation based on MB coding mode and bit-rate information

Although limited only to cut detection, a simple and e!ective approach is proposed in [9]. It only

uses the bit-rate information at MB level and the number of various motion predicted MBs. A large change in bit-rate between two consecutive I or P frames indicates a cut between them. Similarly to [24], the number of backward predicted MBs is used for detecting cuts on B frames. Here, the ratio is calculated as R@"back/mc, where back and mc are the number of backward and all motion com- pensated MBs in a B frame, respectively. The algo- rithm is able to locate the exact cut locations. It operates hierarchically by "rst locating a suspected cut between two I frames, then between the P frames of the GOP and "nally (if necessary) by checking the B frames.

2.2.8. Comparison of algorithms for temporal video segmentation in compressed domain

In [11] the approaches of Arman et al. [5], Patel and Sethi [29], Meng et al. [24], Yeo and Liu [42]

and Shen and Delp [33] are compared along sev- eral parameters: classi"cation performance (recall and precision), full data use, ease of implementa- tion, source e!ects. Ten MPEG video sequences containing more than 30 000 frames connected


Fig. 19. MV patterns resulting from various camera operations.

with 172 cuts and 38 gradual transitions are used as an evaluation database. It is found that the algo- rithm of Yeo and Liu and those of Shen and Delp perform best when detecting cuts. Although none of the approaches recognizes gradual transitions par- ticularly well, the best performance is achieved by the last one. As the authors point out, the reason for the poor gradual transition detection is that the algorithms expect some sort of ideal curve (a pla- teau or a parabola) but the actual frame di!erences are noisy and either do not follow this ideal pattern or do not do this smoothly for the entire transition.

Another interesting conclusion is that not process- ing of all frame types (e.g., like in the "rst two methods) does decrease performance signi"cantly.

The algorithm of Yeo and Liu is found to be easiest for implementation as it speci"es the parameter values and even some performance analysis is al- ready carried out by the authors. The dependence of the two best performing algorithms on bit-rate variations is investigated and shown that they are robust to bit-rate changes except at very low rates.

Finally, the dependence of the algorithm of Yeo and Liu on two di!erent software encoder imple- mentations is studied and signi"cant performance di!erences are reported.

3. Camera operation recognition

As stated previously, at the stage of temporal video segmentation gradual transitions have to be distinguished from false positives due to camera motion. In the context of content-based retrieval systems, camera operation recognition is also im- portant for key frame selection, index extraction, construction of salient video stills and search nar-

rowing. Historically, motion estimation were exten- sively studied in the "eld of computer vision and image coding and used for tracking of moving objects, recovering of object movement, and motion compensation coding. Below we review ap- proaches for camera operation recognition related to shot partitioning and characterization.

3.1. Analysis of MVs

Camera movements exhibit speci"c patterns in the "eld of MVs, as shown in Fig. 19. Therefore, many approaches for camera operation recognition are based on the analysis of MV "elds.

Zhang et al. [46] apply rules to detect pan/tilt and zoom in/zoom out. During a pan most of the MVs will be parallel to a modal vector that corres- ponds to the movement of the camera. This is expressed by the following inequality:


@"h@!hK")¹, (24)

whereh@ is the direction of the MV for block b, hK is the direction of the modal vector, N is the total number of blocks into which the frame is par- titioned and ¹ is a threshold near zero.

In the case of zooming, the "eld of MVs has focus of expansion (zoom in) or focus of contraction (zoom out). Zooming is determined on the basis of

`peripherial visiona, i.e. by comparing the vertical components vI of the MVs for the top and bottom rows of a frame, since during a zoom they have opposite signs. In addition, the horizontal compo- nents uI of the MVs for the left-most and right- most columns are analyzed in the same way. Math- ematically these two conditions can be expressed in


Table 2

MV patterns characterization

Camera operation MV origin MV magnitude

Still No Zero

Panning In"nity Constant

Tracking In"nity Changeable

Tilting In"nity Constant

Booming In"nity Changeable

Zooming Image center Constant

Dollying Image center Changeable

Fig. 20. Decision tree.

the following way:

"vI !v I "*max("vI ","v I "),


"u I !u I "*max("u I ","u I ").

When both conditions are satis"ed, a zoom is declared.

3.2. Hough transform

Akutsu et al. [3] characterize the MV patterns corresponding to the di!erent types of camera operations by two parameters: (1) the magnitude of MVs and (2) the divergence/convergence point, see Table 2.

The algorithm has three stages. During the "rst one, a block matching algorithm is applied to deter- mine the MVs between successive frames. Then, the spatial and temporal characteristics of MVs are de- termined. MVs are mapped to a polar coordinate space by the Hough transform. A Hough transform of a line is a point. A group of lines with point of convergence/divergence (x, y) is represented by a curveo"x sin u#y cos u in the Hough space.

The least-squares method is used to "t the trans- formed MVs to the curve represented by the above equation. There are speci"c curves that correspond to the di!erent camera operations, e.g. zoom is char- acterized by a sinusoidal pattern, pan by a straight line. During the third stage these patterns are recog- nized and the respective camera operations are iden- ti"ed. The approach is e!ective but also noise sensitive and with high computational complexity.

3.3. Supervised learning by examples

An alternative approach for detecting camera operations is proposed by Patel and Sethi [29].

They apply induction of decision trees (DTs) [31]

to distinguish among the MV patterns of the fol- lowing six classes: stationary, object motion, pan, zoom, track and ambiguous. DTs are simple, popu- lar and highly developed technique for supervised learning. In each internal node a test of a single feature leads to the path down the tree towards a leaf containing a class label, see Fig. 20. To build a decision tree, a recursive splitting procedure is applied to the set of training examples so that the classi"cation error is reduced. To classify an example that has not been seen during the learning phase, the system starts at the root of the tree and propagates the example down the leaves.

After the extraction of the MVs from the MPEG stream, Patel and Sethi generate a 10-dimensional feature vector for each P frame. Its "rst component is the fraction of zero MVs and the remaining components are obtained by averaging the column projection of MV directions. In order to develop a decision tree classi"er, the MV patterns of 1320 frames have been manually labeled. The results have shown high classi"cation accuracy at a low computational price. We note that as only MVs of P frames are used, the classi"cation resolution is low. In addition, there are problems with the calcu- lation of the MV direction due to the discontinuity at 0/3603.

The above limitations are addressed in [21]

where a neural supervised algorithm is applied.

Given a set of pre-classi"ed feature vectors (train- ing examples), Learning Vector Quantization (LVQ) [19] creates a few prototypes for each class, adjusts their positions by learning and then


Fig. 21. MV patterns corresponding to di!erent classes.

classi"es the unseen examples by means of the nearest-neighbor principle. While LVQ can form arbitrary borders, DTs delineate the concept by a set of axis-parallel hyperplanes which constrains their accuracy in realistic domains. In comparison to the approach of Patel and Sethi, one more class (dissolve) is added (Fig. 21) and the MVs from both P and B frames are used to generate a 22-dimensional feature vector for each frame. The

"rst component is calculated using the number of zero MVs in forward, backward and interpolated areas. Then, the forward MV pattern is sub-divided in 7 vertical strips for which the following 3 para- meters are computed: the average of the MV direc- tion, the standard deviation of the MV direction and the average of MV modulus. A technique that deals with the discontinuity of angles at 0/3603 is proposed for the calculation of the MV direction.

Although high classi"cation accuracy is reported, it was found that the most di$cult case is to distin- guish dissolve from object motion. In [20] MV patterns are classi"ed by an integration between DTs and LVQ. More speci"cally, DTs are viewed as a feature selection mechanism and only those parameters that appear in the tree are considered as

informative and used as inputs in LVQ. The result is faster learning at the cost of a slightly worse classi"cation accuracy.

3.4. Spatiotemporal analysis

Another way to detect camera operations is to examine the so called spatiotemporal image sequence. The latter is constructed by arranging each frame close to the other and forming a paral- lelepiped where the "rst two dimensions are deter- mined by the frame size and the third one is the time. Camera operations are recognized by texture analysis of the di!erent faces.

In [2] video X-ray images are created from the spatiotemporal image sequence, as shown in Fig. 22.

Sliced x}t and y}t images are "rst extracted from the spatiotemporal sequence and are then subject to an edge detection. The process is repeated for all x and y values, the slices are summed in the vertical and horizontal directions to produce gray-scale x}t and y}t video X-ray images. There are typical X-ray images corresponding to the following camera op- erations: still, pan, tilt and zoom. For example, when the camera is still, the video X-ray show lines parallel


Fig. 22. Creating video X-ray image.

Fig. 23. Producing 2DST image using 25 horizontal and vertical segments.

to the time line for the background and unmoving objects. When the camera pans, the lines become slanted; in the case of zooming, they are spread.

We should note that performing edge detection on all frames in the video sequence is time consum- ing and computationally expensive.

In [23] the texture of 2-dimensional spatiotem- poral (2DST) images is analyzed and the shots are divided into sub-shots described in terms of still scene, zoom and pan. The 2DST images are con- structed by stacking up the corresponding seg- ments of the images (Fig. 23). The directivity of the textures are calculated by computing the power spectrum by applying the 2-dimensional discrete Fourier transform.

4. Conclusions

Temporal video segmentation is the "rst step towards automatic annotation of digital video for browsing and retrieval. It is an active area of re- search gaining attention from several research com- munities including image processing, computer vision, pattern recognition and arti"cial intelli- gence.

In this paper we have classi"ed and reviewed existing approaches for temporal video segmenta- tion and camera operations recognition discussing their relative advantages and disadvantages. More than eight years of video segmentation research have resulted in a great variety of approaches.

Early work focused on cut detection, while more recent techniques deal with gradual transition de- tection. The majority of algorithms process uncom- pressed video. They can be broadly classi"ed into

"ve categories, Fig. 24(a). Since the video is likely to be stored in compressed format, several algorithms which operate directly on the compressed video stream were reported. Based on the type of information used they can be divided into six groups, Fig. 24(b). Their limitations, that highlight the directions for further development, can be sum- marized as follows. Most of the algorithms (1)


Fig. 24. Taxonomy of techniques for temporal video segmentation that process (a) uncompressed and (b) compressed video ( ) detect cut, ( ) detect gradual transitions.

Fig. 25. Taxonomy of techniques for camera operation recognition.

require reconstruction of DC terms of P or P&B frames, or sacri"ce temporal resolution and classi-

"cation accuracy; (2) process unrealistically short gradual transitions and are unable to recognize the di!erent types of gradual transitions; (3) involve many adjustable thresholds; (4) do not handle false positives due to camera operations. None of them is able to distinguish gradual transitions from object movement su$ciently well. Some of the ways to achieve further improvement include the use of additional information (e.g. audio features and text captions), integration of di!erent temporal video segmentation techniques and development of

methods that can learn from experience how to adjust their parameters. Camera operation recogni- tion is an important issue related to the video segmentation. Fig. 25 presents the taxonomy of camera operation recognition techniques.

This research also con"rms the need for bench- mark video sequences and uni"ed evaluation cri- teria that will allow consistent comparison and precise evaluation of the various techniques. The benchmark sequences should contain enough rep- resentative data for the possible types of camera operations and shot transitions, including complex gradual transition (i.e. between sequences involving




Related subjects :