一個關於切割影片以產生單一/多重場景背景之研究

(1)

國

立

交

通

大

學

資訊科學與工程研究所

博

士

論

文

一個關於切割影片以產生單一/多重場景背景之研究

A Study on Single/Multiple Sprite Generation and Partition for Videos

研究生：郭萓聖

指導教授：陳玲慧教授

(2)

一個關於切割影片以產生單一/多重場景背景之研究

A Study on Single/Multiple Sprite Generation and Partition for Videos

研究生：郭萓聖 Student：I-Sheng Kuo

指導教授：陳玲慧 Advisor：Ling-Hwei Chen

國立交通大學

資訊科學與工程研究所

博士論文

A Dissertation

Submitted to Institute of Computer Science and Engineering College of Computer Science

National Chiao Tung University in partial Fulfillment of the Requirements

for the Degree of Doctor of Philosophy

in

Computer Science

July 2008

Hsinchu, Taiwan, Republic of China

(3)

i

一個關於切割影片以產生單一

/ 多重場景背景之研究

學生：郭萓聖

指導教授：陳玲慧教授

國立交通大學資訊科學與工程研究所博士班

摘

要

以物件為基礎編碼的

MPEG-4 採用了一種創新的場景背景編碼方式，該

方法可以提高背景部分的編碼效率。

MPEG-4 所提出的場景背景產生系統，

使用算數平均將所有影像疊合以產生場景背景影像，然而這樣的方式會使

得產生出的場景背景影像中某些區域變得模糊，特別是曾經被移動物件佔

據過的位置。為了防止產生背景模糊的狀況，

MPEG-4 建議使用者提供一個

物件分割遮罩，標示畫面中屬於移動物件的位置，以避免物件被混入場景

背景之中。我們依照

MPEG-4 所提出的架構，建立一個場景背景產生系統。

但手動產生所有畫面的物件分割遮罩是不切實際的，因此在所建立的系統

中，我們提出一個自動化物件分割遮罩產生方法。該方法先以不使用物件

遮罩的方式產生粗糙場景背景，而後以粗糙背景為參考影像產生物件分割

遮罩，最後再以所產生的遮罩重新產生較佳的場景背景。實驗結果顯示所

提的系統產生之場景背景，具有良好的視覺品質。

自動化影像分割方法所產生的物件分割遮罩，不可能非常完美的將所有

移動物件與背景區分。未正確區分的物件分割遮罩，會使得部分移動物件

被混合入背景影像之中。導致所產生的場景背景影像中，出現如鬼影般的

移動物件殘骸。為了解決這個問題，我們將提出一個不需要物件分割遮罩

的場景背景產生系統。所提出的系統包含兩個新方法：均勻化特徵點擷取

方法與智慧型影像疊合方法。提出的特徵點擷取方法估計背景的運動向

量，利用該向量將特徵點中屬於移動物件的點予以排除。同時以均勻化的

擷取方式平均分散所有特徵點的位置。提出的智慧型影像疊合方法使用一

種計數方法，使得只有屬於背景的點被混合入場景背景影像之中。實驗結

果顯示所提出的均勻化特徵點擷取方法能有效的提高全域運動估計的準確

(4)

ii

性，因而提高場景背景影像之品質。提出的智慧型影像疊合方法則能夠將

物件排除在疊合過程以外，使得以提出之方法產生的場景背景影像，不存

在分割失誤可能導致的鬼影現象。其視覺品質接近使用人工產生之物件分

割遮罩產生的場景背景影像，並優於

Smolic et al.提出之使用自動化物件分

割之場景背景產生方法。

場景背景產生系統中，應用了幾何轉換將非參考畫面轉換至參考畫面的

座標系統。進行幾何轉換會使轉換後畫面，以及根據轉換後畫面疊合的場

景背景影像變的歪曲。這使得場景背景影像所需要的儲存空間增加，同時

亦限制了場景背景影像所能夠涵蓋的視角。對此

Farin et al.提出了使用多重

場景背景的方式解決問題。使用多張場景背景影像所需的儲存空間總和，

有可能較使用單一場景背景影像來的小，同時亦能涵蓋較大範圍的視角。

然而

Farin et al.所提出的方法，利用暴力搜尋法找出最佳的影片分割位置。

若有

N 個畫面，這樣的方法需要 O(N

3

)的執行時間與 O(N

2

)的儲存空間。為

了降低運算的複雜度，我們提出一個快速的多重場景背景影片分割方法。

該方法包含一個可能的分割位置選取方法以及一個快速參考畫面選擇方

法。利用測量畫面之間的移動與縮放，以找出影片中有可能的分割位置。

並由這些可能的分割位置尋得最終的分割位置，將影片分割為數個子影

片，最後每一個子影片將產生一個場景背景影像。若所提出的方法找到

M

個可能的分割位置，則所提出的方法僅需要

O(M

2

N)的執行時間以及

O(M

2

)+O(N)的儲存空間。同時所產生的數個場景背景影像的總儲存空間僅

較暴力搜尋法所產生的略高。

(5)

iii

A Study on Single/Multiple Sprite Generation and Partition for Videos

Student：I-Sheng Kuo

Advisors：Dr. Ling-Hwei Chen

Institute of Computer Science and Engineering

National Chiao Tung University

ABSTRACT

Sprite coding, which can increase the coding efficiency of backgrounds

greatly, is a novel technology adopted in MPEG-4 object-based coding. The

sprite generator introduced in MPEG-4 blends frames by averaging blending,

this will make some places, which are ever occupied by moving objects, look

blurring. Thus, providing segmented masks for moving objects is suggested. We

build a sprite generation system based on MPEG-4’s framework, but we find

that using manual segmentation masks in a sprite generation system is

impractical. An automatic segmentation mask generation method is proposed

and is applied in the sprite generation system. The sprite generation system

produces a coarse sprite first by MPEG-4’s method without segmentation masks.

Then the coarse sprite is employed as the reference image in the proposed

segmentation mask generation method. After generating the segmentation masks,

a better sprite is re-generated again with generated segmentation masks.

Experimental results show the sprite generated by the proposed system has good

quality.

Automatic image segmentation can not produce perfect object segmentation

masks. Segmentation faults in segmentation masks causes some moving objects

being blended into a sprite. This makes some ghost-like shadows appear in a

generated sprite. To treat this problem, a sprite generation without segmentation

masks is proposed in this dissertation. The proposed sprite generator consists of

two novel methods: a balanced feature point extraction method and an

(6)

iv

intelligent blending method. The feature point extraction method estimates the

motion vector of background pixels, and excludes pixels of moving objects from

the feature points. Proposed intelligent blending method blends only background

pixels into a sprite by a simple counting schema. Experimental results show the

feature points extracted by the proposed method increases the accuracy of global

motion estimation, and the quality of generated sprites is increased. The

proposed intelligent blending method excludes pixels of moving objects directly

in the blending procedure. Thus ghost-like shadows caused by segmentation

faults is not exist in the sprite generated by our method. The visual quality of our

sprite is close to that using manually segmented masks and is better than that

generated by Smolic et al.’s method.

Due to the geometric transformation applied to each non-reference frame in

the procedure of sprite coding, the generated sprite is distorted and the available

view angles relative to the reference frame are restricted. This makes multiple

sprites used be necessary. An optimal multiple sprite generation method has

been proposed by Farin et al., but it uses an exhaustive search to find the

optimal partition and reference frames. Let N be the number of frames, Frains’

method requires O(N

3

) time and O(N

2

) space to perform the search. In order to

reduce the complexity, a fast multiple sprite partition method is proposed in this

dissertation. The proposed method includes a fast partition point finding method

and a fast reference frame finding method. The proposed partition point finding

method measures translation and scaling between frames and finds candidate

partition points by the measured values. The final partition positions are decided

from these candidate points, and reference frames of each partition are found by

the proposed fast reference frame selecting method. Let M candidate partition

points are found, the proposed method requires only O(M

2

N) in time and

O(M

2

)+O(N) in space. The total size of generated sprites is only slightly higher

(7)

v

誌

謝

這篇論文能夠完成，首先要感謝指導教授陳玲慧博士的細心指導與協助。於陳教授身上，我不僅學習到豐富的學業知識與嚴謹的研究態度，同時亦對待人接物的道理以及為人處事應有的堅持有更深的瞭解。誠摯的感謝陳教授多年來的指導。感謝口試委員莊仁輝教授、荊宇泰教授、陳永昇教授、張隆紋教授、鍾國亮教授、尹邦嚴教授及李建興教授對論文與研究提供許多寶貴的指導與建議，使得本論文更加完善。接著要謝謝自動化資訊處理實驗室一同研究的伙伴們。建興學長、遠坤學長、坤龍學長、昭玲學姊以及瑞祥學長，學長姊們在研究過程中給予的指導與叮嚀，提供很大的幫助。感謝博士班同學民全、文超、惠龍及俊旻，大自學術上的研討，小至生活上的閒談，都讓我在交大的生活更加充實。瓊雪、佩瑩、立人、芳如、信嘉、薰瑩、子翔、偉全等幾位學弟妹，有你們在，實驗室總是充滿歡樂與愉悅的氣氛。同時也要感謝系上助理耀萱，因為她的大力幫忙，我才能夠如期完成畢業口試。此外我也要感謝從小到大求學期間認識的好朋友們：一同通車上下學，經常能夠仗義執言的君燦；經常聊天聊到深夜，互相通報八卦，一起想鬼點子，無話不談的好朋友怡慧；細心體貼、總是為人著想的肚魚以及點子很多、經常帶來歡笑的世懷。特別謝謝草莓公主，這幾年妳的關心與體諒，讓我能夠沒有壓力的完成學位。謝謝三年來陪伴在身邊並慷慨提供自拍照於論文中的百香果。還有中山大學滋青社的社團伙伴們，讓我在大學時期經歷許多美好的體驗與成長；經常給予我人生中許多中肯建議的韶玲老師，以及諮商中心志工團的朋友們。最後我要感謝一路栽培我的父母、舅舅以及兄長於生活上的諸多照料，讓我無需擔心經濟壓力，專心攻讀博士學位。謹以最誠摯的心意將本篇論文獻給我的父母，也獻給曾經在我的人生中給予我鼓勵及協助的每一個人。

(8)

vi

LIST OF TABLES

Table 4.1 Experimental results of sequence ‘stefan’ (perspective)...78

Table 4.2 Experimental results of sequence ‘stefan’ (affine)...78

Table 4.3 Experimental results of sequence ‘tabletennis’. ...78

Table 4.4 Experimental results of sequence ‘building’. ...82

(11)

ix

LIST OF FIGURES

Fig. 1.1 The framework of the sprite generator in MPEG-4 VM...2

Fig. 1.2 The two-stage GME schema. ...7

Fig. 1.3 Image of Hessian value of a frame...8

Fig. 1.4 Reliability mask used in the reliability-based blending. (a) Segmentation mask. (b) Reliability mask derived from (a)...13

Fig. 1.5 Sprite coordinate system and geometric distortions of transformed frames. ...16

Fig. 1.6 Demonstration of geometric distorted frame due to camera rotation. (a) Original frame A. (b) Original frame B. (c) Transformed frame B. ...17

Fig. 1.7 Sprites with different reference frames. (a) Rightmost frame as reference frame. (b) Middle frame as reference frame...18

Fig. 1.8 Geometric distortions using two sprites...19

Fig. 1.9 Effect of camera zoom-in with the details of the reconstructed frame lost...27

Fig. 1.10 Effect of camera zoom-out with the sprite blurred. ...28

Fig. 2.1 Object blurring effect of averaging blending. (a) Original frame. (b) Reconstructed background of (a)...30

Fig. 2.2 The proposed two-pass sprite generator...31

Fig. 2.3 (Continued) The generation of a segmentation mask. (a) The original image. (b) The reconstructed background. (c) The object pixels extracted by using two-stage thresholding. (d) The seed image. (e) The base image. (f) The generated segmentation mask...34

Fig. 2.3 The generation of a segmentation mask. (a) The original image. (b) The reconstructed background. (c) The object pixels extracted by using two-stage thresholding. (d) The seed image. (e) The base image. (f) The generated segmentation mask. ...35

Fig. 2.4 (Continued) Generated sprites of different methods. (a) The first pass. (b) Two pass generation with automatic generated segmentation masks. (c) MPEG-4 with manually segmented masks. ...37

Fig. 2.4 Generated sprites of different methods. (a) The first pass. (b) Two pass generation with automatic generated segmentation masks. (c) MPEG-4 with manually segmented masks. ...38

Fig. 2.5 Reconstructed frames of different methods. (a) The first pass. (b) Two pass generation with automatic generated segmentation masks. (c) MPEG-4 with manually segmented masks. ...38 Fig. 3.1 Feature point extraction based on Hessian value. (a) Original image. (b)

(12)

x

methods. (d) Feature points extracted by the proposed method. ...43 Fig. 3.2 An example for outlier removing. (a) Detected object pixels in Fig. 3.1(d).

(b) The feature points after removing outliers from Fig. 3.1(d). ...46 Fig. 3.3 Two examples to show sprites generated using different feature points with the

same number. (a) The sprite generated using feature points extracted by

conventional methods. (b) The sprite generated using feature points extracted by the proposed method. (c) A close look of the white line in (a). (d) A close look

of the white line in (b). ...47 Fig. 3.4 Segmentation errors in player’s feet. (a) Frame 255 with left foot

incompletely segmented. (b) Frame 258 with both feet incompletely segmented. ...49 Fig. 3.5 Two examples to show the blended sprites using different methods. (a) The

first example of the generated sprite based on the reliability-based blending. (b) The second example of the generated sprite based on the reliability-based blending. (c) The first example of the generated sprite based on the intelligent blending. (d) The second example of the generated sprite based on the intelligent blending...49 Fig. 3.6 Flowchart of the intelligent blending. ...51 Fig. 3.7 Contrast-enhanced version of Fig. 3.5. (a) Contrast-enhanced version of Fig.

3.5(a). (b) Contrast-enhanced version of Fig. 3.5(b). (c) Contrast-enhanced

version of Fig. 3.5(c). (d) Contrast-enhanced version of Fig. 3.5(d)...53 Fig. 3.8 (Continued) The generated sprites. (a) Sprite generated using averaging

blending without segmentation masks. (b) Sprite generated using averaging blending based on manually segmented masks. (c) Sprite generated using

reliability-based blending. (d) Sprite generated using intelligent blending...54 Fig. 3.8 The generated sprites. (a) Sprite generated using averaging blending without

segmentation masks. (b) Sprite generated using averaging blending based on manually segmented masks. (c) Sprite generated using reliability-based blending. (d) Sprite generated using intelligent blending. ...55 Fig. 3.9 The reconstructed backgrounds. (a) Averaging without segmentation masks.

(b) Averaging with manually segmented masks. (c) Reliability-based blending. (d) Intelligent blending...56 Fig. 3.10 PSNR comparison of different blending strategies. ...58 Fig. 3.11 The generated sprite of Lu et al.’s work [21]. ...59 Fig. 3.12 (Continued) Close views of the generated sprites. (a) Part 1 of Lu et al.’s

work [21]. (b) Part 1 of the proposed method. (c) Part 2 of Lu et al.’s work [21]. (d) Part 2 of the proposed method. (e) Part 3 of Lu et al.’s work [21]. (f) Part 3

of the proposed method. ...59 Fig. 3.12 Close views of the generated sprites. (a) Part 1 of Lu et al.’s work [21]. (b)

Part 1 of the proposed method. (c) Part 2 of Lu et al.’s work [21]. (d) Part 2 of the proposed method. (e) Part 3 of Lu et al.’s work [21]. (f) Part 3 of the

(13)

xi

proposed method. ...60

Fig. 4.1 Background displacements of ‘stefan’. (a) Global translations. (b) Accumulated translations. ...64

Fig. 4.2 Finding feasible partition points...67

Fig. 4.3 Accumulated translations of ‘tabletennis’...69

Fig. 4.4 Frame area calculation of a geometric-transformed frame. ...70

Fig. 4.5 Accumulated scaling factors of ‘tabletennis’...71

Fig. 4.6 Generated sprites of sequence ‘stefan’ by different methods. (a) Farin et al.’s optimal method. (b) Proposed method with normal validation. (c) Proposed method with fast validation. ...80

Fig. 4.7 Generated sprites of sequence ‘tabletennis’ by different methods. (a) Farin et al.’s optimal method. (b) Proposed method with normal validation. (c) Proposed method with fast validation. ...81

Fig. 4.8 Generated sprites of sequence ‘building’ by different methods. (a) Farin et al.’s optimal method. (b) Proposed method with fast validation...82

(14)

xii

ABBREVIATION

FPX Feasible Partition Point based on X-axis Translation

FPY Feasible Partition Point based on Y-axis Translation

FPS Feasible Partition Point based on Scaling

GME Global Motion Estimation

GMP Global Motion Parameter

MPEG Moving Picture Expert Group

MPEG-4 VM MPEG-4 Verification Model

MSE Mean-Squared Error

(15)

1

CHAPTER 1 INTRODUCTION

1.1 Motivation

MPEG-4 [1] had adopted a novel technique to code a series of backgrounds belonging to

a scene into a single panoramic image, which is often denoted as a ‘sprite’ or a ‘background

mosaic’ [2-6]. The constructed sprite and some specified parameters are transmitted to the

receiver, and then a decoder can reconstruct the series of backgrounds by the transmitted

information. Since the sprite is transmitted only once, this technique can achieve very low

bit-rate with good quality. Aside from the high coding efficiency of sprite coding, the

generated sprite is also useful for segmenting moving objects [7-11]. The extracted moving

objects and the sprite itself can be used in video summarization [12].

A sprite is constructed in the encoder by a sequence of complex algorithms called a

‘sprite generator’. MPEG-4 VM [13-14] has provided a framework of sprite generator as

shown in Fig. 1.1. The framework contains three parts: global motion estimation (GME),

frame warping and frame blending. The GME aims at finding the spatial location variation,

which is caused by the camera motion, of the background in the current frame relative to the

current sprite. The camera motion can be represented by parameters of some geometric

models often denoted as global motion parameters (GMP). Gradient descent based algorithms

(16)

2

avoid the solution being local optimum, and an error function is required to evaluate the

performance of different GMPs. The squared error between the current frame and the current

sprite is usually employed as the error function. In order to raise speed, only some points are

selected as feature points and involved in the computation of the error function [18-21]. The

selection of feature points decides the estimation accuracy, especially when the number of

feature points is small. Thus, how to select few representative feature points is an important

issue. Most existing methods [18-21] take those points with higher variations in spatial or

temporary domain; this will lose some important ones with moderate variations and will

include some moving object points with higher variations. To avoid this disadvantage, in this

dissertation, a balanced feature point extractor is provided to increase the estimation precision

for GMPs.

Fig. 1.1 The framework of the sprite generator in MPEG-4 VM.

After obtaining GMPs, the current frame is first geometrically transformed (also called

(17)

3

is then blended into the current sprite by a blending strategy. In MPEG-4’s framework,

averaging blending is employed as the blending function. This will make some places, which

are ever occupied by moving objects, blurred. To avoid the disadvantage, providing

manually-segmented masks have been suggested. The segmentation masks are used to

distinguish moving objects from background such that moving objects will not be involved in

blending and the quality of the generated sprite can be significantly improved. However,

segmenting moving objects manually from the video frames is impractical. In this dissertation,

two approaches will be provided to increase the blending quality. The first approach generates

these segmentation masks from a coarsely generated sprite without these masks. Then a better

sprite is generated using the automatically generated masks. The second approach tries to get

rid of the segmentation masks by developing a new blending method that excludes moving

objects from being blended into the sprite.

The video sequence is inputted frame by frame into the sprite generator. The sprite buffer

holds the sprite generated so far and provides the current sprite to the GME process as the

reference image.

1.2 Basic of Sprite Generation

As Fig. 1.1 shows, a sprite generator can be divided into three parts. The global motion

(18)

4

frames. The warping transforms a frame into another frame’s coordinate according to the

camera motion estimated in the global motion estimation. Finally, the blending process blends

the transformed frames and merges them into a single sprite. These parts will be introduced as

followings.

1.2.1 Global motion estimation

The aim of global motion estimation is to obtain an accurate estimation of background

motion between the current frame and a reference image, which is often the previous frame or

the current sprite. Many global motion estimation methods have been proposed [22-26].

Before estimating the background motion, a motion model must be chosen to describe the

motion of background [27]. Affine transformation and perspective transformation are widely

used as the motion model in the generation of a sprite. Then an image registration method [28]

is applied to find corresponding pixel pairs that belong to the same location of background in

both images, and the global motion parameters are estimated by iterative minimization

methods.

1.2.1.1 Motion model selection

The motion of background comes from camera motion, like zooming, panning, and

(19)

5

transformations is often chosen as the camera motion model: the affine transformation and the

perspective transformation. Both of them are defined by two equations with a set of

parameters and as follows:

Affine Transformation: ⎩ ⎨ ⎧ + + = + + = 6 5 4 3 2 1 ' ' m y m x m y m y m x m x (1.1) Perspective Transformation: ⎪ ⎪ ⎩ ⎪⎪ ⎨ ⎧ + + + + = + + + + = 1 ' 1 ' 8 7 6 5 4 8 7 3 2 1 y m x m m y m x m y y m x m m y m x m x (1.2)

where (x,y) and (x’,y’) denote the coordinates of a pixel before and after the camera motion

respectively. m1,m2,…,m8 are the transformation parameters also referred as global motion

parameters (GMPs). We can see that the affine transformation with six parameters is a special

case of the perspective transformation with m7=m8=1. The perspective transformation can

describe more complicated camera motions. However, the higher computational complexity

of the perspective transformation limits its usability. The affine transformation is quite simple,

but it has a lower accuracy. Both of them are adopted as standard tools in MPEG-4. In most of

sprite generators, the perspective transformation is selected since the sprite is usually

generated first before coding the video sequence itself and real-time processing is not

(20)

6

1.2.1.2 Parameter estimation

By using the selected transformation, the global motion estimation can be converted to a

minimization problem: ) ,' , ( min arg * P P E I I T P = (1.3)

where I and I' are the current frame and the reference image respectively. TP is the

transformation function with global motion parameters P. P* is the estimated parameters.

E(I,I',TP) is an error function defined by user. The estimation registers pixels in the current

frame into the reference image by finding the parameters which minimize the error between

the current frame and the reference image. In this dissertation, the squared error is chosen as

the error function, that is,

(

)

∑

∈ − = I y x P P I x y I T x y T I I E , 2 )) , ( ( ' ) , ( ) ,' , ( (1.4)

Due to the complication of global motion parameters, Least-Mean-Square minimization

method is used to find the parameters P*. For example, the Levenberg-Marquardt algorithm

[15] based on the gradient descent method can be used. Gradient descent based methods

search for better parameters around the current parameters, and refine the parameters

iteratively. Thus they have a risk of being trapped into a local minimum. A starting point

called an ‘initial guess’ must be provided. A good initial guess can reduce the risk of being

trapped into a local minimum and can also speed up the refinement process.

(21)

7

processed in two stages [18]. To generate a sprite from a video sequence, the first frame in the

sequence is copied into the sprite directly. Then camera motions between the following frames

and the first frame must be estimated by the global motion estimation. In the beginning of the

sequence, it is easy to find a good initial guess because the camera does not move too far. It

becomes hard to find a good initial guess when the camera motion of the current frame

relative to the current sprite is large. This problem is solved by a two-stage GME schema

shown in Fig. 1.2. The first stage estimates the motion parameters called the local parameters

between the current frame and its previous frame, i.e., the frame before the current frame.

Finding an initial guess of the local parameters is easy because the variation of camera motion

between two successive frames is small. Based on the estimated local parameters, the initial

guess of global motion parameters can be computed in the second stage by combining the

local parameters and the global motion parameters of the previous frame. Then the gradient

descent method is employed to estimate the global motion parameters.

(22)

8

1.2.1.3 Feature points selection

The iterative minimization of the gradient descent method is time consuming. To reduce

the time complexity, only some selected feature points in the current frame are employed

while computing the registration error [18]. In order to avoid the aperture problem [29], the

Hessian value [30], defined by

⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ ⎟⎟ ⎠ ⎞ ⎜⎜ ⎝ ⎛ − ⋅ = 2 2 2 2 2 2 ₍ _, ₎ ₍ _, ₎ ₍ _, ₎ ) , ( dxdy y x I d dy y x I d dx y x I d y x H (1.5)

is employed to find feature pixels in many previous researches [18-21, 23-24]. Those points

with Hessian values being local maximum or minimum are considered as feature points. An

example of using Hessian value to extract feature points is shown in Fig. 1.3. The grayscale of

each pixel in Fig. 1.3(b) represents the absolute Hessian value of the corresponding pixel in

Fig. 1.3(a).

(a) (b)

(23)

9

Except for speeding up the iterative minimization, feature points can be carefully

selected to avoid pixels of moving objects from affecting the parameter estimation. If

segmentation masks are provided by the user, it is simple to exclude the pixels of moving

objects according to the information in the masks. However, if the segmentation masks are not

provided, a sprite generator should detects pixels of moving objects and excludes them

automatically.

1.2.1.4 Feature points and minimization

Each feature point (x,y) in the current frame and its corresponding point (x',y') in the

reference image form a feature point pair, which are used to find the initial guess for camera

motion. The corresponding point is defined to be the motion-estimated point of the feature

point, i.e., (x',y')=(x+dx,y+dy), where (dx,dy) is the motion vector.

As mentioned previously, the perspective transformation expressed in Eq. (1.2) has eight

parameters m1,m2,…m8. By substituting each pair of (x,y) and (x',y') into Eq. (1.2) respectively,

two equations will be built. Thus, four feature point pairs are sufficient to solve the eight

parameters. However, in practical, the corresponding points found by motion estimation are

not precise enough to provide a correct solution. Instead of using four feature point pairs, all

(24)

10

Mean-Square-Error minimization method is employed here. The error function required in the

gradient descent method, is slightly different from Eq. (1.4). Only the errors of the feature

points are counted in the error function, that is,

(

)

∑

∈ − = ints feature_po ) , ( 2 )) , ( ( ' ) , ( ) ,' , ( y x P P I x y I T x y T I I E . (1.6)

The gradient descent method is applied in the estimation of the local parameters and the

global parameters. While estimating the local parameters, the reference image is defined as

the previous frame. And the reference image is defined as the current sprite in the case of

estimating the global parameters.

1.2.2 Warping

The current frame is warped toward the sprite coordinate using the camera model

selected and the parameters estimated in the global motion estimation. Let I be the current

frame, the warped frame IW can be found by geometric transforming the current frame as

)) , ( ( ) , (x y I T x y I_W = _F _P (1.7)

where T is the transformation and P is the estimated global motion parameters. Since the

transformed coordinates are not integers, bilinear interpolation [31] is applied while

generating the warped frame. Since the frame is warped toward the sprite coordinate, the

(25)

11

1.2.3 Blending

The warped frame is blended into the current sprite. The simplest blending method is the

averaging blending. Let X, SC and SU are the intensities of the current frame, the current sprite

and the updated sprite respectively. The averaging blending can be expressed as:

1 * + + = C C C U _N X S N S (1.8)

where NC is the number of pixels blended in the current sprite.

Since the averaging blending simply blends every pixel into the sprite, pixels of moving

objects must be excluded from attending the averaging blending. Similar to the selection of

feature points, excluding these object pixels is easy if segmentation masks are provided. If

these object pixels are not fully excluded and some object pixels are blended into the sprite,

the blended object pixels will leave some shadows in the generated sprite.

1.3 Existing Single Sprite Generators

Many sprite generators [18-20, 32-35] have been proposed. Most of them are based on a

framework provided by MPEG-4 VM [14]. Among these existing sprite generators, the

generator proposed by Smolić et al. [18] is a milestone. They proposed a hierarchal long-term

global motion estimator and a reliability-based blending strategy to generate a sprite. The

reliability-based blending tries to prevent the segmentation faults in segmentation masks from

(26)

12

1.3.1 Smolić et al.’s reliability-based blending

The traditional sprite generation methods produce the sprite by warping and averaging

all frames of the video sequence. However, pixels belonging to the foreground objects will

also be blended into the generated sprite. Using segmentation masks can resolve this problem.

However, automatically generated segmentation masks are always not perfect, and

segmentation faults always exist. These segmentation faults makes some pixels of moving

objects be blended into the sprite and makes the generated sprite blur.

Reliability-based blending is developed to recover the segmentation faults. A

segmentation mask is split into reliable, unreliable and undefined regions to form a reliability

mask. The object pixels in the segmentation mask are defined as undefined region. The

background pixels near an object pixel or the boundary of the frame are defined as unreliable

region. The rest of background pixels are defined as reliable region. Fig. 1.4 shows a

segmentation mask and its reliability mask. The reliable, unreliable and undefined regions are

colored black, gray and white in Fig. 1.4(b), respectively.

Pixels in the undefined region belong to the moving objects, and definitely must not be

blended into a sprite. Pixels in the reliable region are the background pixels, which are safe to

be blended into a sprite. Pixels in the unreliable region are special. They are denoted as

(27)

13

faults, background pixels near an object pixel have a higher possibility of being wrongly

classified. Thus these pixels, which belong to the unreliable region, should be treated

carefully.

(a) (b)

Fig. 1.4 Reliability mask used in the reliability-based blending. (a) Segmentation mask. (b) Reliability mask derived from (a).

While blending a pixel into the current sprite, the reliable level is also recorded. The

reliable level of the current pixel is compared to the reliable level of the current sprite. If the

reliable level of the current pixel is lower than the sprite, the current pixel will not be blended

into the sprite and will be discarded. If the reliable levels of both pixels are identical, the

current pixel is blended into the current sprite by averaging blending. If the reliable level of

the current pixel is higher than the sprite, the sprite pixel is discarded and is replaced by the

current pixels. The discard and replace strategy ensures only pixels with highest reliable level

(28)

14

1.3.2 Watanabe and Jinzenji’s generator

Watanabe and Jinzenji [34] presented a sprite generator with two-passed blending and

automatic foreground object extraction. In the first pass of sprite blending, a provisional sprite

is constructed using the temporal median. Then the foreground objects are extracted

automatically based on the provisional sprite. A difference image of the current frame from

the provisional sprite is calculated, it is used to classify the pixels of the current frame into

foreground and background ones. Then the current frame is divided into blocks, and each

block is classified as either a foreground one or a background one according to the number of

foreground pixels inside the block. The foreground blocks are excluded in the second pass of

sprite blending. There are two disadvantages. One is that getting a perfect segmentation is

impossible due to that a good threshold is needed in block classification and that block used

as the classification unit will make segmentation roughly. The other is that the additional pass

doubles the blending time.

1.3.3 Lu et al.’s generator

Lu et al. [19-21] used a more precise segmentation method proposed by Meier and Ngan

[36] to obtain moving object masks. Based on the obtained masks, a modified

(29)

15

generate sprite. However, the qualities of generated sprites of these methods are still relying

on the precision of segmentation masks.

1.4 Geometric Distortion and Multiple sprites

The performance of a sprite generator is also limited to the perspective motion model

applied in the global motion estimation. The perspective model projects each frame of a video

sequence into a planar reference coordinate system, which is usually the coordinate system of

the first frame. Theoretically, the perspective model employed in MPEG-4 VM can cover 180

degrees of view. However, the useable viewing angle is much smaller in practice, since the

geometric distortion increases rapidly as the camera rotates away from the reference frame.

Fig. 1.5 illustrates the effect of geometric distortion. The focal length of the camera is f

and the reference imaging coordinate system for a sprite is assumed to be the first frame that

is denoted as frame A in Fig. 1.5. All the following frames must be projected to this reference

system by geometric transformation. As the camera rotates, transformed frames are

geometrically distorted, this phenomenon can be found between frame B and transformed

frame B. If camera rotation continues, from Fig. 1.5, we can see that frame C can not be

projected to the reference system.

Fig. 1.6 shows a geometric distorted frame. Frame A which shown in Fig. 1.6(a) is

(30)

16

the transformed frame B relative to frame A is shown in Fig. 1.6(c). On can see that the

transformed frame B is geometric distorted and its size is larger than the original frames. The

distortion causes the frames away from the reference frame are forced to be recorded by

extremely large resolution, but this resolution is useless because the sprite must be scaled

down to display by the decoder. This useless large resolution increases the memory usage and

storage space required to hold the sprite.

Fig. 1.5 Sprite coordinate system and geometric distortions of transformed frames.

In order to overcome the resolution-increasing effect, Massey and Bender proposed a

method using the middle frame of a video sequence as the reference frame [37]. The

generated sprite will be much symmetric and the boundary area of the generated sprite

becomes much smaller if the background of the frames in the video sequence pans toward

only one direction. On the other hand, this method only slows down the increasing effect of frame A transformed frame B on sprite

frame C

f

sprite coordinate system

camera

frame B

(31)

17

the sprite size, but the range of view angle is not extended.

(a) (b)

(c)

Fig. 1.6 Demonstration of geometric distorted frame due to camera rotation. (a) Original frame A. (b) Original frame B. (c) Transformed frame B.

Fig. 1.7 shows two sprites generated from the same portion of sequence ‘stefan’. Fig.

1.7(a) uses the rightmost frame as the reference frame, and Fig. 1.7(b) uses the middle frame

as the reference frame. One can see that the geometric distortion gets worse in the left part of

(32)

18

in Fig. 1.7(b) is evenly spread to both left and right part of the sprite. It is obviously that the

sprite using the middle frame as the reference frame is smaller than that using the rightmost

frame.

(a)

(b)

Fig. 1.7 Sprites with different reference frames.

(a) Rightmost frame as reference frame. (b) Middle frame as reference frame.

A technique using multiple sprites was proposed by Farin et al. [38] to solve the problem.

In their works, the background of a scene is stored by multiple sprites. In order to fit the

MPEG-4 standard, a video sequence is divided into several subsequences, and sprites of all

subsequences are generated independently.

(33)

19

distortion using only one sprite shown in Fig. 1.5, the geometric distortion of frame B in

sprite #2 becomes smaller. Furthermore, frame C, which is unable to be projected into sprite

#1 can be projected into sprite #2 now. Full 360 degrees of camera view can be covered if

more sprites are used. Note that any single sprite must not cover 90 degree or more of camera

rotation over any direction to prevent an effect called ‘degeneration’ [39].

Fig. 1.8 Geometric distortions using two sprites.

Farin et al. [38] have shown that using multiple sprites not only benefits the wider range

of camera view angles but also reduces storage for the generated sprites. This means that

storage required for multiple sprites is smaller than that for only one sprite. However, the

Farin et al.’s method uses exhaustive searches to find the partition points of sub-sequences

and the reference frame of each sub-sequence. The exhaustive searches make the method very frame A (and transformed A on sprite #1

transformed B on sprite #2

transformed C on sprite #2 reference coordinate of sprite #1

camera

reference coordinate of sprite #2

frame C frame B

(34)

20

time-consuming.

Some sprite generators are proposed to employ the multiple sprite technique. Chen et. al.

intergrates a frame skipping techniques and the multiple sprite to speed up the overall

computation time of sprite generation [40-41]. Kunter et al. proposed a experimental

framework [42] to employ multiple sprite into H.264/AVC. None of them is discussing to

speed up the multiple sprite partition.

1.5 Farin et al.’s Optimal Partition Algorithm

In order to find the optimal partition of a video sequence, an evaluation of partition

results must be selected. In order to reduce the computational complexity, the area of the

bounding box around a sprite is chosen to be the evaluation cost function in Farin et al.’s

work. Their optimal partition algorithm is divided into two steps. The first step computes the

minimal costs for coding sprites of all possible sub-sequences and finds the optimal reference

frames of every possible sub-sequence. The second step decides the optimal partition

positions which minimize the coding costs computed in the first step. In the following, we

will give a brief review for their optimal partition algorithm.

1.5.1 Coding Costs Computing and Reference Frames Finding

(35)

21

sub-sequences are computed in this step. For a sub-sequence beginning at frame i and ending

at frame k, a sprite r k i

S_; can be generated for a reference frame r with i≤r≤k. The coding cost for coding r

k i S_; is denoted as r k i S_; . r k i

S_; of all possible combination of i, k, and r are

computed. The number of possible combinations is huge and this computation will take a lot

of time. After computing all r k i

S_; , the optimal reference frame * ;k

i

r of a sub-sequence

beginning at frame i and ending at frame k can be selected by

r k i r k i S r* _; ; =argmin . (1.9)

The minimized coding cost of this sub-sequence, r k i r k

i S

S_; =min _; , is also kept.

The optimal reference frames for every possible sub-sequences are found and stored in

upper triangular matrixes indexed by i and k.

1.5.2 Optimal Partitioning

In order to obtain the optimal partition, not only the starting frame (partition position) of

each sub-sequence but also the number of sub-sequences must be decided. If the video is

partitioned into n sub-sequences, n-1 starting frames need to be decided since the first

sub-sequence always starts at frame 1. For a video with N frames, a partition for the video can

be represented by

)}P={(1,p₁−1),(p₁,p₂−1),(p₂,p₃−1),...,(p_n₋₁,N , (1.10) where pi is the starting frame of sub-sequence i+1. The sprite coding cost of a video sequence

(36)

22

using a partition P is the summation of sprite coding costs of all partitioned sub-sequences,

and the optimal partition P* is selected as

∑

∈ = P k i ik P S P ) , ( ; * _arg_min _{. (1.11)}

The minimization problem is solved efficiently. If the video contains only the first frame,

there is only 1 possible partition P={(1,1)} and the optimal sprite coding cost of the video is 1

; 1

S which is denoted as c1. If the video contains more than one frame, the remaining frames

are then added one by one. When adding frame k, the optimal sprite coding cost ck for the

sequence ending at frame k can be calculated as

{

}

{

}

, min arg min ; 1 ] , 1 [ ; 1 ] , 1 [ k i i k i k k i i k i k S c p S c c + = + = − ∈ − ∈ _(1.12)

where c0 is set to zero. The frame pk is the best partition point to obtain the minimal cost for

each frame k.

After calculating cN which is the minimal sprite coding cost for the entire sequence, the

optimal partition can be obtained by back tracking the stored p-values from pN. That is, the

entire sequence is best partitioned at frame pN to form two sub-sequences (1,p_N −1) and

) ,

(p_N N . The former sub-sequence is further divided at ₍ ₋₁₎

N

p

p , which is the best partition

point of sub-sequence (1,p_N −1), and so on.

(37)

23

The main problems that this dissertation tries to address are based on the

above-mentioned segmentation masks related problems and the time complexity problem of

multiple sprites generating. These will be described in the following.

In this dissertation, we will propose methods to deal with the problems of single and

multiple sprite generation. For the traditional single sprite generation, a new generation

method without using segmentation masks will be proposed to avoid the segmentation faults

from affecting the generated sprite. For the multiple sprite generation, a fast generation

method will be proposed to increase the search speed of sub-sequences and selecting of

reference frames.

1.6.1 Sprite Generation without Segmentation Masks

The framework of MPEG-4 VM shown in Fig. 1.1 requires segmentation masks while

generating sprites. Segmentation masks are binary maps indicating whether a pixel belongs to

moving objects or not. In order to avoid sprite being blurred, pixels of moving objects must be

excluded from being blended into the sprite. If the segmentation is perfect, the averaging

blending provided in MPEG-4 can achieve excellent quality; otherwise, the generated sprite

will be blurred around moving object boundary due to that some pixels of moving objects are

considered as background.

(38)

24

generating automatically during the sprite generation process. No matter how the

segmentation masks are provided, they must be generated automatically to makes the sprite

generation practical. However, it is almost impossible to generate segmentation masks

perfectly automatically. Thus segmentation faults always exist, and the generated sprite

always looks blurry.

On the purpose of reducing the blur caused by segmentation faults, precise segmentation

methods [36] are employed [34] and a new blending strategy denoted as reliability-based

blending [18] is developed. The reliability-based blending is adopted by several sprite

generators [18-21]. In the reliability-based blending strategy, a frame is divided into reliable,

unreliable, and undefined regions according to the segmented masks. Pixels denoted as

objects in the segmented masks are classified as undefined pixels, and pixels near mask

borders or frame borders (within a given distance) are classified as unreliable ones. The rest

of pixels are classified as reliable ones. The reliable and unreliable pixels are average-blended

separately, and the blended pixels with the highest reliability are chosen into the sprite. The

undefined pixels do not contribute to the sprite blending. The given distance from the mask

border must be large enough to cover all segmentation faults, or the generated sprite will have

ghost-like shadows in some places. However, it is hard to decide the distance automatically.

Thus ghost-like shadows still can be found in the generated sprite. In order to overcome this

(39)

25

segmentation masks.

Segmentation masks are not only used in the blending process, but also in the global

motion estimation process. The global motion estimation uses segmentation masks to avoid

objects affecting the accuracy of generated global motion parameters. For removing the usage

of segmentation masks completely from the entire sprite generation process, the global motion

estimation must also be modified.

1.6.2 Fast Multiple Sprites Generation

The perspective model employed in the global motion estimation process makes the

transformed frames geometric distorted, as Fig. 1.6 shows. This distortion become more

seriously as the camera view of frame away from the reference frame. Frames away from the

reference frame are forced to be recorded by extremely large resolution, but this resolution is

useless because the sprite must be scaled down to display by the decoder.

Massey and Bender proposed to use the middle frame of a video sequence as the

reference frame to overcome the resolution-increasing effect. The generated sprite will be

much symmetric and the boundary area of the generated sprite becomes much smaller if the

background of the frames in the video sequence pans toward only one direction. However,

this method only slows down the increasing effect of the sprite size, but the range of view

(40)

26

Using multiple sprites can solve this problem. Before generating multiple sprites, a video

sequence must be divided into several subsequences. Each frame in the video sequence can be

a partition position from which the video sequence is divided into subsequences. Thus, a

partition algorithm is needed. For a video sequence with N frames, there will be 2N-1 combinations of partitions. Not only the partition position but also the reference frame of each

subsequence must be selected by the partition algorithm. The selection of reference frames

greatly affects the size of generation sprites. Each frame in a subsequence can be selected as

the reference frame of the subsequence. If the sequence is divided into K subsequences, each

subsequence has Mi frames, where i is the index of a subsequence. There will be

∏

= K i i M 1 selections for reference frames. Farin et al. proposed an optimal multiple sprite partition

algorithm. A cost function representing the total area of all generated sprites was defined, and

a smart exhaustive search through the entire partition positions and reference frame

possibilities was proposed. Since the partition algorithm is an exhaustive search, it finds the

optimal solution. However, it is very time-consuming.

Apart from the geometric distortion, the effect of camera zoom-in and zoom-out will

highly affect the generated sprite and the reconstructed frames [38]. Fig. 1.9 illustrates how

camera zoom-in operation affects generated sprite and reconstructed frame. Since the

reference coordinate system is based on the reference frame, the zoomed-in frame has to be

(41)

27

forever during the down-sampling and the reconstructed frame is degraded.

Fig. 1.9 Effect of camera zoom-in with the details of the reconstructed frame lost.

In contrast to the camera zoom-in, the camera zoom-out operation makes the generated

sprite looks blur after the zoomed-out frame is blended into the sprite. As Fig. 1.10 shows, the

zoomed-out frame has to be up-sampled before blending into the sprite. This will make the

blended sprite blurred. Furthermore, the up-sampled frame will occupy a very large area in

the sprite. This causes the area of generated sprite being expanded rapidly.

In this dissertation, we try to develop a fast multiple sprites partition method and fast

reference frame selection method with acceptable total sprite areas. zoomed-in frame

reconstructed zoomed-in frame reference frame

down-sampling

(42)

28

Fig. 1.10 Effect of camera zoom-out with the sprite blurred.

1.7 Synopsis of the Dissertation

The rest of the dissertation is organized as follows. Chapter 2 describes the proposed

sprite generation method with automatically generated segmentation masks. In Chapter 3, the

proposed sprite generation method without using segmentation masks will be introduced. The

fast multiple sprite partition and reference frame selection methods are proposed in Chapter 4.

Some conclusions and future research directions are drawn in Chapter 5.

zoomed-out frame

reference frame

(43)

29

CHAPTER 2 AUTOMATIC GENERATION OF SEGMENTATION MASKS

FOR SPRITE GENERATION

In this chapter, we will propose a sprite generator with an automatic generation of

segmentation masks. The generation process contains two passes. The first pass generates a

coarse sprite by conventional averaging blending method. Then we generate segmentation

masks of every frame automatically from the coarse sprite. In the second pass of sprite

generation, the final sprite is generated with the generated segmentation masks to reduce the

effect of moving objects. The details of the proposed generator are described as follows.

2.1 Proposed Two-Pass Sprite Generation

As Fig. 1.1 shows, MPEG-4’s sprite generation framework requires auxiliary

segmentation masks to reduce the effect of moving objects. Segmentation masks are used in

global motion estimation to raise the estimation accuracy. These masks can also be used to

avoid moving objects attending the sprite blending. It is impractical to build these masks

manually. Thus an automatic generation of segmentation masks is necessary. Since the sprite

is a merged background in the video sequence, it can be used as a reference background for

(44)

30

2.1.1 Object blurring effect of averaging blending

The averaging blending used in sprite generation has a blurring effect to every pixel in

the sprite. If there does not have segmentation masks, the averaging blending blends pixels of

moving objects and background together. In case of enough frames are blended, the moving

objects will be blurred and only shadows of objects are left in a generated sprite. Fig. 2.1

demonstrates the blurring of moving objects. Figs. 2.1(a) and (b) are an original frame and its

reconstructed background from generated sprite, respectively. One can see that the moving

player is blended into the sprite, and leave white shadows in the reconstructed background.

Although the quality of reconstructed background is degraded by these shadows, the

backgrounds in the original frame are still visible in the reconstructed background.

(a) (b)

Fig. 2.1 Object blurring effect of averaging blending. (a) Original frame. (b) Reconstructed background of (a).

(45)

31

2.1.2 Frame segmentation in sprite generation

Since reconstructed backgrounds in a sprite generated by averaging blending without

segmentation masks are only blurred by moving objects, the reconstructed backgrounds can

be used as reference backgrounds to detect moving objects. A two-pass sprite generation with

automatic segmentation masks generation is proposed and shown in Fig. 2.2.

Fig. 2.2 The proposed two-pass sprite generator.

In the first pass of the proposed generator, a coarse sprite is generated first by the

MPEG-4’s sprite generator without segmentation masks. The coarse sprite will contain

shadows of moving objects definitely. Then the reconstructed backgrounds of the coarse sprite

(46)

32

segmentation masks automatically. Finally the sprite is re-generated in the second pass by the

MPEG-4’s sprite generator with the generated segmentation masks.

2.2 The First Pass of Sprite Generation

The first pass of sprite generation is almost identical to the MPEG-4’s framework. In the

global motion estimation, some feature points are extracted by selecting the pixels with larger

Hessian value. Then global motion parameters are estimated by a least mean-square-error

minimization method, the Levenberg-Marquardt algorithm. An averaging blending method is

employed as the blending method. No segmentation masks are applied in this pass of sprite

generation. Thus moving objects will be blended and the generated coarse sprite will be

blurred. The coarse sprite will be used to extract the segmentation masks automatically.

2.3 Automatic Generation of Segmentation Masks

Despite of the blurred areas, the reconstructed backgrounds still carry most of

background information. By subtracting the original frame by the reconstructed background,

we can get an image of the moving objects. In order to remove the effect of peak noise, the

block difference is applied instead of the pixel difference.

The pixel difference D is defined as the magnitude of the difference between the original

(47)

33

the candidates of object pixels. Pixels with D value larger than t1 are considered as candidates.

For each candidate, a 5×5 block B centered on the candidate is taken. The block difference DB,

is defined as

∑

∈ = B j i B D i j D ) , ( ) , ( . (2.1)

The candidate with block difference larger than a preset threshold t2 is considered as an object

pixel. The two-stage thresholding technique computes the block differences only for those

pixels with higher possibility to be objects. It reduces the complexity of computing block

difference for each pixel.

There are two problems while extracting the object pixels. First, the object regions are

often ill-shaped with holes. Second, there are some small-sized regions which are

misclassified as objects. These problems can be solved using morphological processing and

region selecting. Let O be a binary image representing the results of thresholding in the

previous step. Pixels judged as objects will be set to one and others will be set to zero. Two

binary images called seed and base images are computed. The seed image is produced by

applying morphological erosion to O using a disk shaped structure element of radius 2, and

the base image is produced by applying morphological dilation to O using the same shaped

structure element of radius 5. The region selecting is applied on the base image. An object

region is selected if any of its pixels have a value of one in the seed image. The segmentation

(48)

34

Fig. 2.3 gives an example of generating a segmentation mask. The original frame and the

reconstructed background of the frame are shown in Figs. 2.3(a) and (b) respectively. By

subtracting the original frame by the reconstructed background and performing the two-stage

thresholding, the image of the extracted object pixels is shown in Fig. 2.3(c). The seed image

shown in Fig. 2.3(d) and the base image shown in Fig. 2.3(e) are generated by applying the

morphological processing to Fig. 2.3(c). Finally, the segmentation mask is produced by

region selecting and shown in Fig. 2.3(f). The object regions are colored black.

(a) (b)

(c) (d)

Fig. 2.3 (Continued) The generation of a segmentation mask. (a) The original image. (b) The reconstructed background. (c) The object pixels extracted by using two-stage

thresholding. (d) The seed image. (e) The base image. (f) The generated segmentation mask.

(49)

35

(e) (f)

Fig. 2.3 The generation of a segmentation mask. (a) The original image. (b) The reconstructed background. (c) The object pixels extracted by using two-stage

thresholding. (d) The seed image. (e) The base image. (f) The generated segmentation mask.

Most part of the moving objects was extracted correctly, except two unclassified parts:

the upper part of bat and the player’s legs. The upper part of bat is nearly transparent hence

the background is visible through the bat; the legs of the player have similar intensities to the

background. Thus, both misclassified parts do not affect the blending result. Moreover, the top

and right borders are also classified as object; this will eliminate the black line shadows in the

generated sprite. Note that the tennis ball is also classified as an object. These generated

segmentation masks will be employed in the second pass of sprite generation.

2.4 The Second Pass of Sprite Generation

The generated segmentation masks are employed in the second pass of sprite generation.

(50)

36

automatically generated segmentation masks are employed in the global motion estimation

and the blending process. These modifications remove the effect caused by considering the

object pixels as background, and increase the fidelity of the generated sprite.

In the global motion estimation, the generated segmentation masks are employed as a

classification of object pixels. All feature points are checked with the masks. Feature points

which are classified as moving objects in the masks are removed from the feature points.

Then the global motion parameters are also estimated by the Levenberg-Marquardt algorithm.

The accuracy of estimated parameters should be increased since the effect of object pixels is

reduced.

The sprite is then blended using the newly estimated parameters. Since generated

segmentation masks are available in the second pass, the reliability-based blending is adopted

instead of the averaging blending employed in the first pass. The reliability-based blending

prevents some of moving objects that not segmented correctly from being blended into the

final sprite. The generated sprite in the second pass is outputted as the final sprite.

2.5 Experimental Results

Fig. 2.4 shows the generated sprite of the video sequence ‘stefan’ by different methods.

Fig. 2.4(a) is generated by the MPEG-4’s method without using segmentation masks, that is

(51)

37

generate Fig. 2.4(b) are obtained automatically by the proposed segmentation schema.

Fig. 2.5 shows one of the reconstructed frames by different methods respectively. Like

we stated before, the sprite generated without using masks contains shadows, which are

circled in Fig. 2.4(a), caused by wrongly blending the player into sprite. These shadows are

successfully removed in the sprite generated using the masks generated automatically by our

method. Manually segmented masks are employed in Fig. 2.4(c) and Fig. 2.5(c) for

comparisons. Both sprites generated using automatically or manually segmented masks are

perceptually the same by human eyes.

(a)

Fig. 2.4 (Continued) Generated sprites of different methods. (a) The first pass. (b) Two pass generation with automatic generated segmentation masks.

一個關於切割影片以產生單一/多重場景背景之研究

國

立

交

通

大

學

資訊科學與工程研究所

博

士

論

文

一個關於切割影片以產生單一/多重場景背景之研究

A Study on Single/Multiple Sprite Generation and Partition for Videos

研 究 生：郭萓聖

指導教授：陳玲慧 教授

一個關於切割影片以產生單一/多重場景背景之研究

A Study on Single/Multiple Sprite Generation and Partition for Videos

研 究 生：郭萓聖 Student：I-Sheng Kuo

指導教授：陳玲慧 Advisor：Ling-Hwei Chen

國 立 交 通 大 學

資 訊 科 學 與 工 程 研 究 所

博 士 論 文

一 個 關 於 切 割 影 片 以 產 生 單 一

/ 多 重 場 景 背 景 之 研 究

學生：郭萓聖

指導教授：陳玲慧 教授

國立交通大學資訊科學與工程研究所博士班

摘

要

以物件為基礎編碼的

MPEG-4 採用了一種創新的場景背景編碼方式，該

方法可以提高背景部分的編碼效率。

MPEG-4 所提出的場景背景產生系統，

使用算數平均將所有影像疊合以產生場景背景影像，然而這樣的方式會使

得產生出的場景背景影像中某些區域變得模糊，特別是曾經被移動物件佔

據過的位置。為了防止產生背景模糊的狀況，

MPEG-4 建議使用者提供一個

物件分割遮罩，標示畫面中屬於移動物件的位置，以避免物件被混入場景

背景之中。我們依照

MPEG-4 所提出的架構，建立一個場景背景產生系統。

但手動產生所有畫面的物件分割遮罩是不切實際的，因此在所建立的系統

中，我們提出一個自動化物件分割遮罩產生方法。該方法先以不使用物件

遮罩的方式產生粗糙場景背景，而後以粗糙背景為參考影像產生物件分割

遮罩，最後再以所產生的遮罩重新產生較佳的場景背景。實驗結果顯示所

提的系統產生之場景背景，具有良好的視覺品質。

自動化影像分割方法所產生的物件分割遮罩，不可能非常完美的將所有

移動物件與背景區分。未正確區分的物件分割遮罩，會使得部分移動物件

被混合入背景影像之中。導致所產生的場景背景影像中，出現如鬼影般的

移動物件殘骸。為了解決這個問題，我們將提出一個不需要物件分割遮罩

的場景背景產生系統。所提出的系統包含兩個新方法：均勻化特徵點擷取

方法與智慧型影像疊合方法。提出的特徵點擷取方法估計背景的運動向

量，利用該向量將特徵點中屬於移動物件的點予以排除。同時以均勻化的

擷取方式平均分散所有特徵點的位置。提出的智慧型影像疊合方法使用一

種計數方法，使得只有屬於背景的點被混合入場景背景影像之中。實驗結

果顯示所提出的均勻化特徵點擷取方法能有效的提高全域運動估計的準確

性，因而提高場景背景影像之品質。提出的智慧型影像疊合方法則能夠將

物件排除在疊合過程以外，使得以提出之方法產生的場景背景影像，不存

在分割失誤可能導致的鬼影現象。其視覺品質接近使用人工產生之物件分

割遮罩產生的場景背景影像，並優於

Smolic et al.提出之使用自動化物件分

割之場景背景產生方法。

場景背景產生系統中，應用了幾何轉換將非參考畫面轉換至參考畫面的

座標系統。進行幾何轉換會使轉換後畫面，以及根據轉換後畫面疊合的場

景背景影像變的歪曲。這使得場景背景影像所需要的儲存空間增加，同時

亦限制了場景背景影像所能夠涵蓋的視角。對此

Farin et al.提出了使用多重

場景背景的方式解決問題。使用多張場景背景影像所需的儲存空間總和，

有可能較使用單一場景背景影像來的小，同時亦能涵蓋較大範圍的視角。

然而

Farin et al.所提出的方法，利用暴力搜尋法找出最佳的影片分割位置。

若有

N 個畫面，這樣的方法需要 O(N

)的執行時間與 O(N

)的儲存空間。為

了降低運算的複雜度，我們提出一個快速的多重場景背景影片分割方法。

該方法包含一個可能的分割位置選取方法以及一個快速參考畫面選擇方

法。利用測量畫面之間的移動與縮放，以找出影片中有可能的分割位置。

並由這些可能的分割位置尋得最終的分割位置，將影片分割為數個子影

片，最後每一個子影片將產生一個場景背景影像。若所提出的方法找到

研究生：郭萓聖

指導教授：陳玲慧教授

研究生：郭萓聖 Student：I-Sheng Kuo

國立交通大學

資訊科學與工程研究所

博士論文

一個關於切割影片以產生單一

/ 多重場景背景之研究

指導教授：陳玲慧教授

個可能的分割位置，則所提出的方法僅需要