A Visual Attention based Region-of-Interest Determination Framework for Video Sequences ∗

Wen-Huang CHENG^†a), Wei-Ta CHU^††b), and Ja-Ling WU^†c), Nonmembers

SUMMARY This paper presents a framework for automatic video region-of-interest determination based on visual attention model. We view this work as a preliminary step towards the so-lution of high-level semantic video analysis. Facing such a chal-lenging issue, in this work, a set of attempts on using video atten-tion features and knowledge of computaatten-tional media aesthetics are made. The three types of visual attention features we used are intensity, color, and motion. Referring to aesthetic principles, these features are combined according to camera motion types on the basis of a new proposed video analysis unit, frame-segment.

We conduct subjective experiments on several kinds of video data and demonstrate the effectiveness of the proposed framework.

key words: region-of-interest, video analysis, visual attention model, computational media aesthetics.

1. Introduction

The rapid progress of the technology for multimedia production has contributed to the extensive use of mul-timedia, the explosive development of mobile commu-nication, especially the ever-increasing importance of video communication, such as video phone and video-on-demand. Wide-ranging usage of video communica-tions bring in several visible trends: 1) More and more end users have devices with diverse capability, such as Pocket PC and Smartphone. 2) As the types of net-works, devices, and compression formats increase, in-teroperability among different systems and networks become more important. 3) There is too much re-dundant information in multimedia documents to be processed efficiently. In facing these challenges, one of the key technologies is region-of-interest (ROI) deter-mination [1][2], which benefits in the applications of content adaptation, transcoding, and intelligent infor-mation management, etc. Moreover, it provides a

prac-Manuscript received October 8, 2004.

Manuscript revised February 6, 2005.

Final manuscript received March 15, 2005.

†The author is with the Graduate Institute of Network-ing and Multimedia, National Taiwan University, Taipei, Taiwan, 10617, R.O.C.

††The author is with the Department of Computer Sci-ence and Information Engineering, National Taiwan Univer-sity, Taipei, Taiwan, 10617, R.O.C.

a) E-mail: [email protected] b) E-mail: [email protected]

c) E-mail: [email protected]

∗This work was partially supported by the

CIET-ticable way for semantic level analysis without the need of fully understanding about the document’s content.

In general, an ROI is a portion of a multimedia document that audiences show more interest in or pay more attention to than others. For the ease of expla-nation, we give a precise definition of an ROI, first. An ROI is a portion of a frame that contains the key con-cept or main subject of a visual scene and provides end users a more concise and informative representation of a document, e.g., the speaker should be one of the ROIs in a conference scene.

In the literature, schemes proposed for determin-ing ROIs can be divided into two categories: saliency-oriented and task-saliency-oriented. The saliency-oriented scheme is to predict what will involuntarily attract our visual attention in a scene, and where to identify the in-teresting regions when the saliency information is given.

According to psychological findings about the primate visual system and eye fixation, quite a few vision mod-els for still images have been developed to simulate the cognitive mechanism of human beings. One well-known approach is based on Itti’s visual attention model [3], in which several spatial visual features are combined into a single saliency map for representing local conspicu-ity in images. This model has been extensively studied in many fields and was shown to be robust in intel-ligent processing of digital images [4][5][6]. However, due to the ignorance of temporal aspects, its extension to moving pictures needs to be explored.

Some approaches for analyzing video attentions are then proposed. Ma et al. [7][8] presented user atten-tion models for video skimming and summarizaatten-tion, which utilized more audio-visual features of semantics, for example, motion, speech, camera operation, and lexical information. In his paper, although the video features are shown to be effective in detecting tem-poral attentions, their interactions with spatial visual features are still unknown. Ho et al. [9] proposed a framework for video focus detection based on visual at-tention, which introduced a video-genre-based method for saliency map generation. That is, in different video categories, different parameter sets are elaborately opti-mized and accordingly assigned. The experiment shows impressive result, but the method is too highly

domain-2 IEICE TRANS. INF. & SYST., VOL.Exx–??, NO.xx XXXX 200x

Fig. 1 Block diagram of the proposed framework for conduct-ing video ROIs determination.

fined goal and what will voluntarily attract his atten-tion when studying a scene. In this case, a salient loca-tion may be completely filtered out for its irrelevance of the viewer’s goal. Navalpakkam et al. [10] propose an architecture to estimate the task-relevance of attended locations in a scene. The location-based relevant in-formation is represented with a task-relevance map for each image. Recent researches by Cater et al. [11], Golenzer et al. [12], and Lin et al. [13] are also clas-sified in this class. Generally, with an explicit descrip-tion about the viewer’s target, task-oriented schemes get better performance than those of saliency-oriented ones. However, the goal or tasks may not be always available in advance.

In summary, the problems associated with conven-tional video ROI determination, based on visual atten-tion model, can roughly be divided into three cate-gories. The first one is the lack or unsuitably treat-ment of temporal and motion information, and the sec-ond is that fixed or video-genre-based feature combin-ing method seems to be problematic for practical use.

Finally, little effort is put on integrating the advantages from both saliency-oriented and task-oriented schemes.

In this work, we consider the problem from the viewpoints of both visual attention model and compu-tational media aesthetics [14][15]. Our goal is to de-velop a framework that can be used to determine the video ROI using computable visual features and gen-eral video-shooting principles. In this way, the superi-orities of saliency and task oriented schemes are both integrated in our work. In addition to light and color, object motion is adopted as one visual feature in our at-tention model. Rather than a single frame, we choose

Fig. 2 An example of operations of the proposed framework.

The input frame-segment is under static-with-object-motion camera type (defined in Section 3.4.2).

veloped and applied to the framework. We conduct lots of experiments on kinds of video data and demonstrate the effectiveness of the proposed framework in video ROI determination.

The rest of this paper is organized as follows. Sec-tion 2 presents the proposed framework for video ROI determination. Visual attention representation and camera motion utilization are described in Section 3.

Section 4 discusses the dynamic ROIs determination from a saliency map. Section 5 shows experimental re-sults, and Section 6 presents our concluding remarks.

2. An Overview of the Proposed Framework The block diagram of the proposed framework is illus-trated in Fig. 1. The input video is first segmented by a reliable shot boundary detection algorithm [16], which can correctly detect abrupt shot changes and gradual transitions. Further, each shot is partitioned into non-overlapped ”frame-segment” (will be explained in Sec-tion 3). For each frame-segment, one camera moSec-tion type is registered. This camera motion information will be used to generate the saliency map later. Meanwhile, the corresponding feature maps generated from each of the feature models are computed. By taking account of the camera motion types, different kinds of feature maps are combined elaborately. Finally, the integrated saliency map is constructed. The video ROIs are then dynamically estimated according to the active area of saliency maps. An operational example of the proposed framework is illustrated in Fig. 2.

3. Visual Attention Representation

CHENG et al.: A VISUAL ATTENTION BASED REGION-OF-INTEREST DETERMINATION FRAMEWORK FOR VIDEO SEQUENCES 3

ship between camera motion and visual attention is de-scribed. Based on our observations, a novel method for saliency map generation is presented.

3.1 Visual Attention Model

Visual attention refers to the ability of a viewer con-centrating his attention on some visual objects or re-gions. Previous research showed that this physiological process could be modeled by the so-called visual atten-tion model [3][8]. In our work, three types of video-oriented visual features (intensity, color, and motion) are adopted to model the visual attraction of videos by using the same idea.

3.1.1 Contrast Based Intensity and Color Feature Model

One of the most important ingredients of visual atten-tion model is the contrast [17]. In psychology, per-ceptual experiments have shown that some color pairs, such as red-green and blue-yellow, possess high spatial and chromatic opposition. The same characteristics ex-ist in high difference lighting or intensity pairs. Based on these observations, we include three contrast based feature models: intensity, red-green color contrast, and blue-yellow color contrast, into our visual attention rep-resentation module. The contrast maps are respectively defined as follows. intensity, red, green, blue, and yellow component value functions, respectively.

3.1.2 Motion Feature Model

The motion of objects plays an essential role in a video.

It allows the video-maker to direct the audience’s at-tention across the two-dimensional space of a frame [18]. In the proposed framework, two feature models:

x-motion and y-motion, are used to represent the mo-tion informamo-tion of a video frame. The x-momo-tion and the y-motion refer to the horizontal and the vertical movements of a specific pixel within a frame, respec-tively.

If we consider a video as a frame sequence with spa-tial axes (x, y) and temporal axis t, the spatio-temporal

slice with axes (y, t). To find the motion activity in the scene, the two-dimensional (2-D) structure tensor (ST ) [19][20] of the slices is evaluated. Compared with other motion descriptors, the 2-D ST is adopted for that the coherence (or confidence) measure can also be estimated. The 2-D ST , J, is expressed as

J = the partial derivatives of a horizontal slice along the spatial and temporal dimensions. Consequently, the local motion angle θxand its corresponding confidence measure (cmx) can be computed as

θx= 1 The vertical slice is processed in the same way to obtain the corresponding θy and cmy. Finally, the x-motion and the y-motion maps are individually calculated as:

M_X(p) = θ_x× cm_x, (7)

MY(p) = θy× cmy, (8)

where p = [x, y]^T is, again, a position vector.

3.2 Frame-segment

In previous research, visual attention is modeled and determined mostly for only a single, at most, for two consecutive frames. The collection of determined re-gions of each independent single frame composes the final ROIs of a video sequence. However, based on our previous observations [9], we found that the single- or two-frame based approach only generates acceptable re-sults for images but not for video ROI analysis. For example, the focus point may swiftly tremble due to a slight difference between two consecutive frames. This unpleasant phenomenon does not exist in viewers’ at-tention. If the estimated ROIs are applied to other extended applications, such as scalable coding and con-tent adaptation, the prescribed defect will cause signifi-cant deficiency in both bit rate and quality. Due to the fact that the content of a video would not change dras-tically in a short duration, we take a short video clip, called frame-segment, as the unit for conducting the video ROI analysis. The new defined frame-segment takes both spatial and temporal correlations into ac-count and can suppress noises caused by sudden lumi-nance change, such as flashlights. In our experiments, the length of a frame-segment is empirically set to 0.5

4 IEICE TRANS. INF. & SYST., VOL.Exx–??, NO.xx XXXX 200x

Fig. 3 Example of feature maps. (a) original video frame, (b) intensity, (c) red-green color, (d) blue-yellow color, (e) x-motion, and (f) y-motion feature maps.

Fig. 4 The procedure of generating the filtered intensity fea-ture map from a frame-segment.

3.3 Feature Map and Filtered Feature Map

For each video frame of the same frame-segment, the distributions of each of the features are calculated and constructed as five feature maps, as shown in Fig. 3.

Therefore, each frame-segment has five map sets and each of them is composed of 15 feature maps belonging to a specific feature. A temporal median filter is then applied to each of the sets to find the corresponding filtered feature map. Fig. 4 shows an example for gen-erating the intensity filtered feature map. Note that the temporal median filter plays an important role in the process. The average effect of filtering for frames within a segment can effectively suppress noises and sub-salient regions so that each filtered feature map rep-resents the general characteristics of a specific feature in the frame-segment. In other words, both the spatial

Fig. 5 Demonstrations of visual attention under left-pan cam-era motion. The t- to (t + 2∆t)-th video frames are captured in an interval of 0.5 seconds (i.e., ∆t = 0.5 seconds) from a TV sports program. The white squares indicate the possible attentive regions in the first frame.

3.4 Camera Motion Based Saliency Map Generation 3.4.1 Relations between Camera Motion and Visual

Attention

Nowadays, a large amount of videos are produced ac-cording to the principles of computational media aes-thetics, especially the expert-produced videos [14]. From the viewpoint of video shooting, different camera mo-tions have different impacts on the audience’s reception.

In other words, they influence the relative importance of each visual feature and reveal what and where the video-maker wants viewers to see. The idea has been extensively used in film or TV show productions. On the other hand, from the perspective of task-oriented gaze control, the phenomenon that directors purposely move their camera to control the audience’s fixations appropriately serves as a high-level hint for integrating spatiotemporal visual features.

Fig. 5 gives a real example. If you take the first (or t-th) video frame as a still image, your eyes will freely scan the entire image and attracted by some noticeable regions, such as the scoreboard, screen texts, or play-ers. However, if you take it as one of the video frames, that is, look at these frames rapidly in succession, you will find that your eyesight involuntarily moves left with the panning camera and is mostly attracted by horizon-tally moving objects. Your vision unconsciously follows the camera’s track within the scene and the relative saliency of each region has accordingly been changed.

Therefore, it is our belief that camera movement should be considered in the process of ROI determination.

3.4.2 Camera Motion Registration

CHENG et al.: A VISUAL ATTENTION BASED REGION-OF-INTEREST DETERMINATION FRAMEWORK FOR VIDEO SEQUENCES 5

Table 1 The ranges of the non-uniform bins used to quantize tensor histogram.

Table 2 Weights for filtered feature maps under different cam-era motion types.

Zoom L/R-Pan U/D-Tilt Static Motion

Intensity 0.2 0.05 0.05 0.15 0.05

red-green 0.2 0.05 0.05 0.075 0.05

blue-yellow 0.2 0.05 0.05 0.075 0.05

X-motion 0.2 0.75 0.1 0.35 0.425

Y-motion 0.2 0.1 0.75 0.35 0.425

zoom, left-pan, right-pan, up-tilt, down-tilt, static-with-no-motion, and static-with-object-motion. The spatio-temporal slices based motion analysis techniques [19] are used to register the camera motion type of each frame-segment. We use two tensor histograms. One is for all horizontal slices and another is for all ver-tical slices, and they are denoted as MH and MV, respectively. Within a frame-segment, all the local mo-tions generated from the motion feature models are non-uniformly quantized into five bins Φi, i = -2 ∼ 2 (c.f. Table 1).

After constructing the two tensor histograms, a rule-based algorithm is applied to detect camera mo-tion. We take two examples, say zoom and left-pan op-erations, to explain the detailed processes. For zoom, the tensor votings of positive-motion-angle bins and negative-motion-angle bins are approximately the same in both horizontal and vertical slices tensor histograms.

That is, For left-pan operation, the camera is moving fast to-ward left direction, so the detected right-direction mo-tion would be much greater than the left-direcmo-tion one.

The value of right-direction motion, MH(Φi), Φi > 0, would be greater than a given camera motion thresh-old to ensure that the motion is induced by the camera itself. That is, where κ is the camera motion threshold. The other camera motion types can be decided following the sim-ilar way.

Fig. 6 Examples of determined ROIs (dotted-line squares) for two different settings of feature weights. The manually marked ground truths are indicated by solid-line squares.

generic saliency map are decided according to the reg-istered camera motion types (will be described in the next subsection). The generic saliency map is gener-ated according to the following equation:

S(N ) = αc,1× F F M1+ · · · + αc,n× F F Mn, (11) where S(N ) is the generated generic saliency map of a frame-segment with length N . F F Mi is the i-th filtered feature map of that segment, and αc,i is the weight of the corresponding F F Miunder given camera motion type c. Table 2 shows the weights for various camera motion types and filtered feature maps used in our framework. These weights are defined elaborately to present characteristics of different camera motion types. For example, when camera panning occurs, the horizontal motion should be emphasized.

3.4.4 Procedure for Feature Weights Selection As shown in Fig. 6, selection of appropriate feature weights is important in ROI determination. However, due to the large amount of candidate weights and their combinations, it’s impossible to decide an appropriate combination of weights manually. On the other hand, it’s also unpractical to do the selection through ex-haustive search, because the weights selection depends highly on human’s subjective perception. In this work, we exploit an generic procedure to sieve out some can-didates from the weights combinations based on certain selection rule, first. Then the final decision is made by the end user. Since the procedure is applicable to all the adopted camera types, without loss of generality, we only describe the analysis of left-pan operation in the following.

First, a set of frame-segments F = {Fi, i = 1∼T (e.g., T = 50)} are carefully chosen from various kinds of videos. Without loss of generality, one definite main subject is assumed to be contained in each F_i and have been manually marked as the ROI. These frame-segments with marked ROIs form the ground truth of our training benchmark. Let wj, j = 1∼5, be

6 IEICE TRANS. INF. & SYST., VOL.Exx–??, NO.xx XXXX 200x

Fig. 7 Examples of frame-segments with (a) one and (d) two ROIs (indicated by the white squares); (b) and (e) are the cor-responding saliency maps, and (c) and (f) are the 3-D profiles of the saliency maps of (a) and (d), respectively.

Note that P₅

j=1wj = 1. Then, all possible combina-tions of the weight vector (w1, w2, . . . , w5) are gener-ated according to a predefined deviation step size, say 0.005. As shown in Fig. 6, if a weight vector correctly reflects the relative importance of each feature, the re-gion size and location of a determined ROI will highly match with those of the ground truth. That is, the overlapped area of the dotted-line and solid-line squares will nearly equal to their joint region. Based on the

在文檔中經驗融合：兼具安全性及延展性之多媒體人本計算 (I) 產學合作計畫成果報告 (總計畫) (頁 174-184)