雙階層視訊分析 – 由靜態背景模型到動態前景切割

(1)

國

國立

立

立交

交

通

通大

大

大學

學

資

資訊

訊

訊科

科

科學

學

學與

與

與工

工

工程

程

程研

研

研究

究

究所

所

博

博士

士

論

論文

文

雙

雙階

階

階層

層

層視

視

視訊

訊分

訊

分

分析

析

_{析 –}

_–

_{– 由}

由

由靜

靜

靜態

態背

態

背

背景

景

景模

模

模型

型

型到

到

到動

動

動態

態

態前

前

前景

景

景切

切

切割

割

Bi-Layer Video Analysis

_{− from Static}

Background Modeling to Dynamic

Foreground Segmentation

研

研究

究

究生

生

生：

：

林

林泓

泓

宏

指

指導

導

導教

教

教授

授

授：

：

：莊

莊仁

莊

仁

仁輝

輝

博

博士

士

劉

劉庭

庭

庭祿

祿

博

博士

士

(2)

(3)

雙

雙階

階

階層

層

層視

視

視訊

訊

訊分

分析

分

析

_{析 –}

_–

_{– 由}

由

由靜

靜

靜態

態

態背

背

背景

景

景模

模

模型

型

型到

到

到動

動態

動

態

態前

前

前景

景

景切

切

切割

割

Bi-Layer Video Analysis

_{− from Static Background}

Modeling to Dynamic Foreground Segmentation

研究生：林泓宏指導教授：莊仁輝博士

劉庭祿博士

Student: Horng-Horng Lin Advisor: Dr. Jen-Hui Chuang

Dr. Tyng-Luh Liu 國立交通大學資訊科學與工程研究所博士論文 A Thesis Submitted to

Institute of Computer Science and Engineering College of Computer Science

National Chiao Tung University

in Partial Fulfillment of the Requirements for the Degree of

Doctor of Philosophy in

Computer Science

March 2011

(4)

(5)

謹獻給我的父母 — 林良應先生與陳玉樺女士

(6)

(7)

雙

雙階

階

階層

層

層視

視

視訊

訊分

訊

分

分析

析

_{析 –}

_–

_{– 由}

由

由靜

靜

靜態

態背

態

背

背景

景

景模

模

模型

型

型到

到

到動

動

動態

態

態前

前

前景

景

景切

切

切割

割

學

學生

生

生：

：

林泓

林

泓

泓宏

宏

指

指導

導

導教

教

教授

授

授：

：

莊

莊仁

仁

仁輝

輝

博

博士

士

劉

劉庭

庭

庭祿

祿

博

博士

士

國

國立

立

立交

交

交通

通

通大

大

大學

學

學資

資

資訊

訊

訊科

科

科學

學

學與

與

與工

工

工程

程

程研

研

研究

究

究所

所

摘

要

雙階_{層視訊分割 — 即對視訊影片作前景層與背景層的區域切割 — 是電腦} 視覺領域中一個極具挑戰性的問題，蓋因視訊內容的變化多樣，使得前景與背景的階層分割變得複雜。對於此一問題，我們在論文中分別以「背景模型初始化」、「背景模型維護」與「視訊階層遞移」三個研究主題來進行探討；其中，前兩個研究主題，是針對固定式攝影機所拍攝的影片，作靜態背景模型的建構與維護，使得前景階層，可透過與背景相減分割出來，而第三個研究主題，則是針對移動式攝影機所拍攝的動態影片，作前景與背景階層的遞移分割。在背景模型初始化的研究主題探討中，我們開發一個以影像區塊為基礎的快速背景模型估計法，並提出新穎的背景模型完整度量測法則，使得一個完整的初始背景模型，可被快速建構出來。在背景模型維護的研究主題探討中，我們檢視了常用的高斯混合模型，發現在高斯混合模型中，需要兩種型態的學習速率控制，方可有效地平衡背景變化容忍度與前景偵測敏感度兩項拮抗因素，對此，我們提出一個基於高階資訊回饋的新式學習速率控制法，來改良高斯混合模型之背景模型維護方式。在視訊階層遞移的研究主題探討中，我們提出一

(8)

個基於半監督式頻譜叢集法的動態階層分割架構，來對於移動式攝影機所拍攝的視訊影像，逐張作前景與背景的階層分割；其中，我們進一步擴充了半監督式頻譜叢集法的數學模型，以便在視訊階層分割過程中，調控階層標籤估計的可靠度，以增進階層分割的準確性，實驗結果顯示，此一視訊階層遞移分割架構，對動態影片的切割具有良好的效果。關鍵字：雙階層視訊分割、背景模型、高斯混合模型、半監督式頻譜叢集。

(9)

Bi-Layer Video Analysis

_{− from Static}

Background Modeling to Dynamic

Foreground Segmentation

Student: Horng-Horng Lin

Advisor: Dr. Jen-Hui Chuang

Dr. Tyng-Luh Liu

Institute of Computer Science and Engineering

National Chiao Tung University

Abstract

Bi-layer video segmentation, i.e., the extraction of foreground regions from back-ground ones for a video sequence, is a challenging research field in computer vision due to large content variation among video frames. To better address this bi-layer video segmentation problem, three research topics are investigated in this thesis including background model initialization, background model maintenance, and video layer propagation. While the first two topics concern static background modeling for analyzing videos obtained from static cameras, the third one pertains to dynamic foreground segmentation for videos captured by moving cameras.

For the problem of background model initialization, we propose an efficient background model estimation scheme based on image block classification, and

(10)

de-problem of background model maintenance, we look into the formulations of Gaus-sian mixture modeling (GMM) and identify the needs of two types of learning rates for GMM to effectively deal with a trade-off between robustness to background changes and sensitivity to foreground abnormalities. A novel bivariate learning rate control scheme for GMM based on a feedback of high-level information is also proposed. For the problem of video layer propagation, a new framework based on semi-supervised spectral clustering is proposed for dynamic foreground segmenta-tion of a video shot captured by a moving camera. The adopted formulasegmenta-tion of semi-supervised spectral clustering is generalized to regularize the reliabilities of layer labels in sequential propagation. Experimental results show that satisfactory results of related bi-layer video analysis can indeed be obtained with the proposed approaches.

Keywords: Bi-layer video segmentation, background modeling, Gaussian mixture modeling, semi-supervised spectral clustering.

(11)

誌

誌謝

謝

非常幸運地，我能得到兩位指導教授的教導。在博士修業的前期，我同時也在中研院資訊所服國防役，是資訊所的劉庭祿老師，帶領我進入電腦視覺與機器學習的研究領域，教導我基礎知識，告訴我研究的目標與紀律，協助我建立研究常規，更對我在研究與工作上有許多包容；同時，在本科領域外，劉老師也帶領我初窺生物資訊領域的堂奧，回想起種種情境，無不讓我深深感念於心，劉老師的教導，對我的博士修業及日後的工作與研究，有著至為深遠的影響。國防役期滿後，回到交通大學成為全職學生，交大資工的莊仁輝老師，則是給予學生獨立研究的訓練，讓我在研究主題選擇上有很大的自由度，並逐步引導我作研究推展、論文撰改、投稿與答辯等，帶領我走向獨立研究之路；同時，莊老師自我碩士修業以來，逾十年的指導中，除了專業領域外，更涵蓋了我生活與家庭層面，提供我許多深具人生智慧的建議，使得我的研究與人生路程得以平順，長期以來，莊老師對我所付出的諸多心力，實無可計量。而今博士修業將告一段落，我深覺我是何等幸運，能夠同時得到兩位對學生至誠至性的師長，對我長年的指導，沒有他們，我不會走向研究之路，也無從得見研究之樂，更遑論學位的完成；對兩位老師的感激，遠非言語所能形容！指導教授以外，我也特別感謝交大資工的蔡文祥老師，在我執行研究計畫時，所給予的指導與勉勵；以及中研院資訊所的陳祝嵩老師對我的關懷與幫助。此外，我也非常感激在中研院資訊所時陳煥宗、林彥宇、張天龍、趙盈

(12)

勝、陳俊宏、蔡玉寶、張文彥、葉士良等同事的研究討論、建議與鼓勵，以及在交大時高肇宏、吳至仁、羅國華、林哲寬、陳宇欣、吳思慧、邱郁婷、陳光兆、劉怡伶、李宗穎、蔡易達、吳佳昱等同學對我研究上的協助，是他們，豐富了我的研究歷程。同時，我也衷心感謝威聯通科技的黃哲文經理與張明智總經理，是他們的大力支持，我才有機會將公司工作與論文研究結合，進一步整理成研究成果。而在論文口試時，口試委員王才沛老師、王聖智老師、洪一平老師、范國清老師、廖弘源老師、賴尚宏老師、蔡文祥老師所給予的寶貴建議，也將作為我日後研究工作的指引。在此同時，我也要對我的父母獻上最深的謝意與敬意，沒有他們無私的付出與栽培，我將無所憑、無所立。同時我也要謝謝內人意屏這些年來的支持，讓我可以投入研究工作、完成學業。最後，謝謝學習歷程中許多協助過我的朋友，謹致上我由衷地感激。

(13)

List of Figures

2.1 The general idea of background model initialization . . . 8

2.2 Notations for background model initialization . . . 13

2.3 Flowchart of the bottom-up and top-down processes . . . 16

2.4 Training images for background model initialization . . . 26

2.5 Distributions of training data and training results . . . 28

2.6 Results of background model initialization . . . 30

2.7 Results of background block detection . . . 31

2.8 Results of different parameter settings . . . 34

2.9 Comparisons of [22], [54], and our approach . . . 36

2.10 Tests on lighting variations . . . 38

2.11 Background model initialization and tracking . . . 40

3.1 Flowchart of a general-purposed surveillance system . . . 54

3.2 Simulated changes of the learning rate ηt . . . 57

3.3 Example of motion blur . . . 58

3.4 Examples of quick and double-quick lighting changes . . . 60

(18)

3.6 Comparisons of background modeling for missing object and waving

hand . . . 66

3.7 Comparisons of background modeling without and with using the background-type rate control . . . 67

3.8 Comparisons of background adaption to double-quick lighting change 70 3.9 Snapshots of the ground-truth images . . . 71

3.10 Quantitative comparisons of [36], [54], and our approaches . . . 72

3.11 Comparisons of scene change adaptation . . . 75

3.12 Foreground detection and background modeling results for other scenarios . . . 76

4.1 Illustration of the spatio-temporal neighbors of a block . . . 82

4.2 Examples of kernel construction . . . 89

4.3 Examples of sub-graph _G′ i in different representations . . . 92

4.4 Simulated example on the regularization of label reliability . . . 96

4.5 Quantitative evaluations of the proposed regularization of label re-liability . . . 98

4.6 Results of the IU experiment . . . 100

4.7 Results of the IU experiment with user interventions . . . 101

4.8 Quantitative evaluations for the IU experiment . . . 102

4.9 Snapshots of the Mobile sequence and its ground-truth layer masks 104 4.10 Results of the Mobile experiment . . . 105

(19)

List of Tables

2.1 Comparisons between SVMs and CGBoost . . . 29 2.2 Average error rates in different threshold settings . . . 29 2.3 Detection error rates with and without top-down validation. . . 32 3.1 Numbers of image frames resisting background adaptation to

(20)

(21)

Chapter 1 Introduction

Bi-layer video segmentation, which involves the extractions of foreground regions from background ones for a video sequence, is a challenging research field in com-puter vision, due to large content variation among video frames. Understanding of video contents via computer-assisted analysis, which is one of the main goals of intelligent video analytics for surveillance applications and multimedia search, can be greatly benefited by stable and accurate video layer segmentation. In this the-sis, the research problem of bi-layer video analysis for segmenting videos captured by static and moving cameras are investigated along three research directions: background model initialization, background model maintenance and video layer propagation. While the first two directions concern static background layer mod-eling for analyzing video sequences obtained from static cameras, the third one addresses dynamic foreground/background layer extraction for video sequences captured by moving cameras.

(22)

1.1 Background Model Initialization

For a video sequence captured by a static camera, its foreground objects can of-ten be efficiently extracted via background subtraction if a background model for properly describing a static background scene is given. Despite the large amount of previous research works on background modeling, the initialization of a stable background model for a busy scene, such as a road junction with heavy traffic, has been less discussed. In our investigation of the problem of background model initialization, a new estimation scheme that combines bottom-up and top-down information to construct a stable and complete background model is presented in Chapter 2, wherein efficient image block classification for background model con-struction is proposed and novel criteria for the measurement of background model completeness is developed. Experimental results show that the efficient block-based processing, together with the effective model completeness measure, can derive stable background models for busy scenes and outperforms the compared approaches.

1.2 Background Model Maintenance

Once a proper background model for a scene of interest has been initialized, this model needs to be maintained thereafter to catch background changes, such as en-vironmental lighting variations, so that foreground regions can be accurately differ-entiated. For background model maintenance, Gaussian mixture modeling (GMM) is a popular choice due to its capability of adaptation to periodic background variations. However, the effectiveness of GMM is often limited by a trade-off

(23)

be-tween statistical robustness to background changes and sensitivity to foreground abnormalities, and is inefficient in managing the trade-off for various surveillance scenarios. To solve this problem, a novel bivariate learning rate control scheme for GMM based on a feedback of high-level information is proposed in Chapter 3. Experimental results show that the proposed GMM approach is superior to the compared GMM-based methods in delivering better background model adaptation results for challenging scenarios with the aforementioned trade-off.

It is worth noting that two distinct approaches are applied in this thesis to solve the problems of background model initialization and maintenance. In general, the proposed GMM approach for solving the problem of background model mainte-nance can almost always give promising performance in background adaptation and foreground detection. However, according to our study, such an approach is not suitable for the problem of background model initialization, mainly due to its deficiency in evaluating whether a derived background model by GMM is stable and/or complete, as will be discussed in Chapter 2. On the other hand, while the proposed approach for background model initialization delivers a more stable back-ground model than the GMM approach, it is less capable of capturing dynamic background changes, like waving trees, in long-term model maintenance, and may result in a stable but slightly blur background model. Therefore, we present a two-stage treatment to the general problem of static background modeling, with each stage adopting a different approach, which will best fit each specific need mentioned above.

(24)

1.3 Video Layer Propagation

For the case of moving camera, we investigate the problem of video layer propaga-tion in Chapter 4, wherein foreground and background video layers of a video shot are segmented in a sequential manner for consecutive image frames. Assume that the bi-layer image segmentation for the first video frame of a video shot is given in advance. The goal of video layer propagation is to extract the corresponding video layer segments in subsequent video frames. Except for the initial layer infor-mation, no prior assumptions or restrictions, e.g., on foreground shapes, on back-ground models, or with respect to camera motions, are made. This general setting brings a big challenge in problem solving because, for example, a background layer may be cluttered and may undergo large changes in a video shot due to camera movements. To extract time-varying video layer segments, a new framework based on semi-supervised clustering is developed. Under this framework, image blocks are used as layer propagation units to avoid costly pre-segmentation of images into super-pixels. By modeling video layer propagation between consecutive im-age frames as a label inference problem wherein new block labels are inferred from previously known ones, and by solving this problem via semi-supervised spectral clustering, video layers are progressively propagated. Experimental results show that the proposed video layer propagation method can effectively extract dynamic video layers, even in large, non-rigid motions.

(25)

1.4 Thesis Organization

The rest of this thesis is organized as follows. In Chapter 2, we investigate the prob-lem of background model initialization by presenting an overview of background modeling techniques, the proposed background model initialization approach and experimental results. Assuming that an initial background model is obtained, we discuss the problem of background model maintenance in Chapter 3 by addressing the trade-off between model robustness and sensitivity, giving a GMM-based so-lution for balancing the trade-off, and presenting experimental results to support the effectiveness of the proposed solution. In Chapter 4, we explore the prob-lem of video layer propagation by presenting a survey of related literature, the proposed video layer propagation framework based on semi-supervised clustering, and some experimental results. Finally, we discuss the effectiveness of the pro-posed approaches for the three research topics studied in this thesis, as well as future explorations, in Chapter 5.

(26)

(27)

Chapter 2 Background Model Initialization

via Classification

To efficiently construct a scene background model is crucial for tracking techniques relying on background subtraction. Our proposed method is motivated by criteria leading to what a general and reasonable background model should be, and real-ized by a practical classification technique. Specifically, we consider a two-level approximation scheme that combines the bottom-up and top-down information for deriving a background model in real time. The key idea of our approach is simple but effective: If a classifier can be used to determine which image blocks are part of the background, its outcomes can help to carry out appropriate block-wise updates in learning such a model. The quality of the solution is further improved by global validations of the local updates to maintain the inter-block consistency. A com-plete background model can then be obtained based on a measurement of model completion. To demonstrate the effectiveness of our method, various experimental results and comparisons are included.

(28)

I0 I13 I69 Estimated Background

Figure 2.1: The general idea of background model initialization. Through perform-ing on-line classifications and by iteratively integratperform-ing the frame-wise detected background blocks of images captured with a static monocular camera, the scene background can be reliably estimated in real time.

2.1 Overview

Visual tracking systems using background subtraction often work by comparing the upcoming image frame with an estimated background model to differentiate moving foreground objects from the scene background. Hence the performance of such systems depends heavily on how the background information is modeled initially, and maintained thereafter. In this work, we aim to establish a learning approach to reliably estimate a background model even when substantial object movements are present during the initialization stage. As illustrated in Fig. 2.1, the overall idea is to efficiently identify background blocks from each image frame through on-line classifications, and to iteratively integrate these background blocks into a complete model so that a tracking process can be automatically initiated in real time. In developing such a progressive processing scheme for initializing a background model, some criteria are considered.

• Stationary scene adaptation: It is commonly agreed that stationary scenes are considered as background. Thus, in our design, when a moving object becomes stationary over a certain period of time, it will be incorporated

(29)

into a background model. This would yield an initial background model accommodating the most recent statistics about the background scene, e.g., a parking car or an occluded area.

• Gradual variation adaptation: The computation of a background model should take account of small variations caused by, for example, gradual il-lumination changes, waving trees, and faint shadows. It allows a system to reduce the false detection rate of foreground objects.

• Model completion: Depending on object movements, the number of image frames needed in estimating an initial background could vary significantly. Hence, a measurement for the availability of a background model has to be defined so that the system can immediately begin to track objects upon the completion of model initialization.

• Efficiency: A background model must give rise to efficient on-line derivations to guarantee real-time tracking performance.

The first two criteria listed above manifest what kind of scene contents are con-sidered as background. The last two ones illustrate the design requirements of a background model initialization system: it should be capable of deriving a com-plete background model in a progressive manner and in real-time.

In the proposed approach, two features will be observed. First, we utilize learn-ing methods to identify background blocks. Rather than developlearn-ing discrimination rules or models, we adopt learning approaches to construct a background block classifier. This strategy not only provides a convenient way of defining some pre-ferred background types from image examples, but also avoids complicated issues

(30)

of manually setting discriminating parameters, because they can be resolved by learning from the chosen data. Second, the derived background model fulfills the four criteria. To achieve efficiency, a progressive estimation scheme is developed and a fast classifier adopted. For the model completion criterion, an effective defi-nition is given to indicate that a complete background model is obtained, and the subsequent tracking procedures can be started. Regarding the adaptation crite-ria, we implement a bottom-up block updating, in either a gradual or an abrupt fashion, for capturing the background variations and scene changes, respectively.

2.1.1 Related Work

Background modeling for tracking typically involves three issues: representation, initialization, and maintenance. For example, one could represent a scene back-ground by assuming a single Gaussian distribution for each pixel, initialize the model by estimating from an image sequence, and maintain it during tracking by updating Gaussian parameters of the background pixels. While the emphases of most previous works, including those to be described later, are mainly on rep-resentation and maintenance, the task to compute an initial background model has been somewhat neglected or otherwise simplified by not allowing large object movements throughout the initialization process, e.g., [14], [41], [64].

Background representation and maintenance

Gaussian models are perhaps the most popular representation for modeling a scene background, e.g., [6], [36], [42], [54], [64]. Their maintenance is usually carried out in the form of temporal blending to update intensity means and variances. Thus

(31)

related researches often differ in the number of Gaussian distributions used for each pixel, and the update formulas for the Gaussian parameters. In [20], Gao et al. further investigate possible errors caused by Gaussian mixture models, and then apply statistical analysis to estimate related parameters.

Apart from Gaussian assumptions, Elgammal et al. [13] consider kernel smooth-ing for a non-parametric estimate of pixel intensity over time. In [58], Toyama et al. propose a wallflower algorithm to address the problem of background repre-sentation and maintenance in three levels: pixel, region, and frame levels. Ridder et al. [47] use a Kalman-filter estimator to identify the respective pixel intensities of foreground and background from an image sequence, and to suppress false fore-ground pixels caused by shadow borders. In [27], a mixture of local histograms is proposed to construct a texture-based background model that is more robust to background variations, e.g., illumination changes.

Prior assumptions about the foreground, background, and shadows can be used to simplify the modeling complexity. For vehicle tracking, Friedman and Russell [19] propose three kinds of color models to classify pixels into road, shadow, and vehicle. They employ an incremental EM to learn a mixture-of-Gaussian for distinguishing the foreground and background. In [48], [59], prior knowledge at pixel level is considered in learning the model parameters of the foreground and background. Then, a high-level process based on Markov random field is performed to integrate the information from all pixels.

Background model initialization

(32)

appropriate for practical uses. Haritaoglu et al. [23], instead, compute intensity medians over time. Yet a more general framework by Stauffer and Grimson [54] is to use pixel-wise Gaussian mixtures to model a scene background. Mittal and Huttenlocher [42] later extend the Gaussian mixture idea to construct a mosaic background model from images captured using a non-stationary camera. In [58], bootstrapping for background initialization is proposed, and implemented with a pixel-level Wiener filtering.

Among the above-mentioned approaches, initializing a background model is viewed more or less as part of the process for background maintenance. They do not have a systematic way to measure the quality, and determine the degree of completion for such a model. Consequently, these methods often require simple initializations, or otherwise start tracking activities with unreliable background models.

For computing an explicit background model, Gutchess et al. [22] use optical flow information to choose the most likely time interval of stable intensity at each pixel. However, the quality of their derived background model depends critically on the accuracy of the pixel-wise optical flow estimations. Cucchiara et al. [9] represent a background model by pixel medians of image samples, and specifically identify moving objects, shadows and ghosts1 _{for different model updates using}

color and motion cues. Based on the Gaussian mixture model, Hayman and Ek-lundh [26] formulate a statistical scheme to derive a mosaic background model with an active camera. They consider a mixel distribution to correct the errors in background registration. In [11], De la Torre and Black apply principal component

1

Ghosts are false foreground objects detected by subtracting an inaccurate background model from image frames.

(33)

It Bet I_t−1 Be∗ t−1 bi t ebit bi t−1 eb∗i_t−1

(a) On-line image stream (b) Estimated background model Figure 2.2: Notations for background model initialization. (a) It and It−1 are the

image frames at time t and t− 1, and their ith blocks are denoted as bi

t and bit−1,

respectively. (b) For the background models, eBt is a possible estimation at time

t, while eB∗

t−1 is the best estimation up to time t− 1. Accordingly, their ith blocks

are represented by ebi

t and eb∗it−1.

analysis (PCA) to construct the scene background from an image sequence. More recently, Monnet et al. [43] propose an incremental PCA to progressively estimate a background model and detect foreground changes. Still, these systems all lack an explicit criterion for determining whether a background initialization is completed or not—a crucial and practical element for a real-time tracking system.

Other techniques that explore layer decompositions of a video sequence can also be used to estimate a background model. Irani and Peleg [28] explore the decompo-sitions of dominant motions and apply them to the construction of an unoccluded background image. In [29], [17], sprite layers are derived from probabilistic mix-ture models, in which cues of layer appearances and motions are encoded. In [1], [2], Aguiar and Moura consider rigid motions, intensity differences, and the region rigidity for figure-ground separation, and formulate them as a penalized likelihood model that can be optimized in efficient ways. In [7], [32], [63], graph-cut-based techniques, such as [5], are applied to decompose video layers via pixel labeling, with various objective functions being optimized. Though all the layer-based ap-proaches are capable of deriving a background model even for dynamic scenes,

(34)

they often need to process a video sequence in batch, which is different from the proposed progressive scheme.

2.2 Background Model Estimation via

Classifi-cation

Due to the restriction of limited memory space and the requirement of real-time performance, only a small number of recent image frames are stored and referred during the construction of a background model. Thus, an iterative estimation scheme is proposed in the following to progressively identify background blocks in image frames and to incorporate their information into a background model.

2.2.1 Iterative Estimation Scheme

To illustrate the idea of the proposed iterative estimation scheme, we begin by summarizing the notations and definitions adopted in our discussion.

• We denote the test image sequence up to time instant t as It ={I1, I2, . . . , It},

and the most recent ℓ image frames as It,ℓ = {It−ℓ+1, . . . , It−1, It}. We also

use bi

t to stand for the ith block of It , and bit,ℓ =

bi

t−ℓ+1, . . . , bit−1, bit

for the set of ith blocks from It,ℓ (see Fig. 2.2).

• Let eBt be any possible background model estimation at time t, and eB_t−1∗ be

the estimated background model at time t− 1. Then, the ith blocks of eBt

and eB∗

t−1 are denoted as ebit and eb∗it−1, respectively.

(35)

build a binary classifier, where each xi is a fixed-size image block (or simply

the extracted feature vector), and yi ∈ { −1 (foreground), +1 (background) }

is its label.

• With training data D, an optimal classifier f∗ _{can be defined as}

f∗ = arg max

f

p(f _{| D).} (2.1) Equation (2.1) manifests that a classifier f∗ _{can be derived from a probabilistic}

maximum a posteriori (MAP) treatment [49]. It is thus more desirable to have not only classification labels/scores but also probabilistic outputs of f∗_{. In Sec. 2.3}

we will explain that either an SVM or a boosting-with-soft-margins classifier is appropriate for delivering such probabilities. With probabilistic outputs, a thresh-old can then be set to adjust the classification boundary, which is useful for our background estimation. We will demonstrate this usage in Sec. 2.4.1.

The proposed iterative estimation scheme for deriving a background model consists of a bottom-up block updating and a top-down model validation process. As shown in Fig. 2.3, a flowchart is given to illustrate the interactions between the two processes. The aim of the bottom-up process is to block-wise integrate identified background blocks into a model and to form a model candidate eBt. Then,

in the top-down process, the inter-block consistency for all the updated background blocks are validated. By assuming that significant background updates often occur in groups, isolated updates that mainly result from noises will be eliminated by restoring their block statistics back to the previous estimates eb∗i

(36)

B a c k g r o u n d M o d e l T o p - d o w n B o t t o m - u p I n t e r - B l o c k C o n s i s t e n c y V a l i d a t i o n s I m a g e B l o c k B a c k g r o u n d B l o c k E s t i m a t i o n C l a s s i f i e r B a c k g r o u n d B a c k g r o u n d M o d e l M a i n t e n a n c e R e p l a c e m e n t C a n d i d a t e f∗ bi t eb∗i t−1 (ebi t| bit, eb∗it−1) (ebi t| bit) e Bt e Bt e B∗ t

Figure 2.3: Flowchart of the bottom-up and top-down processes. The flowchart depicts the interactions between the bottom-up block updating and the top-down model validation processes. While the bottom-up process handles block-wise up-dates of the background, the top-down one deals with inter-block consistency val-idations. The coupling of the two processes forms an efficient scheme for deriving a background model.

More specifically, in the bottom-up process, the image block bi

t classified as

background and the previously estimated background block eb∗i

t−1 act as two inputs

to the background adaptation. Based on a dissimilarity measure between the cur-rent image block bi

t and the previous background block eb∗it−1, either a maintenance

step or a replacement step is invoked for a block update. In the maintenance step, the case of the small block difference is handled, assuming it is mostly caused by gradual lighting variations or small vibrations. A new background block estimate ebi

t can thus be computed by a weighted average of the two blocks bit and eb∗it−1. On

the other hand, when bi

tand ebit are dissimilar, implying an occurrence of an abrupt

scene change, a replacement step is employed to calculate a renewed background block estimate ebi

t which is consistent with the image block bit.

After the above bottom-up updates, a background model candidate eBt is

(37)

is introduced to assure the model consistency between the current candidate eBt

and the previous estimate eB∗

t−1, by assuming a smooth changing in background

models. Though the checking of model consistency can be realized in various ways, we choose to implement it in a simple manner by finding the updates of isolated blocks and undoing them. Thus large and grouped background block up-dates are preserved in this design, since they most likely belong to significant and stable background changes, such as newly uncovered scenes or stationary objects. Through the validation process, a final background model estimation eB∗

t is derived.

It is worth mentioning that the entire approach is linked to a MAP formulation, i.e., e B∗ t = arg maxBet            Y i+ P (ebi_t | bi t,eb∗it−1) Y i− P (ebi_t| eb∗i t−1) ! | {z } Likelihood P ( eBt | eB_t−1∗ ) | {z } P rior            , (2.2) where i+₌_{{i | b}i

t is classified as a background block by f∗} and i− ={1, . . . , n} −

i+_{. (Assume there are n blocks in an image frame.) Interested readers can find the}

derivation of (2.2) in Appendix. The connections between (2.2) and our approach are elaborated as follows. Regarding the likelihood part, the two products can be viewed as block-wise updates after the background classification. For the image block classified as background, maximizing the probability P (ebi

t | bit,eb∗it−1) implies

that similarities among the background estimate ebi

t, the image block bit and the

previous estimate eb∗i

t−1 should be retained. Likewise, for a foreground block, the

corresponding probability P (ebi

t | eb∗it−1) is maximized by setting the current block

(38)

pro-cess. Referring to the prior term, it indicates that model level consistency between e

Bt and eB_t−1∗ needs to be maintained for maximizing the probability P ( eBt| eB_t−1∗ ).

This, as well, corresponds to the top-down model validation. However, we note that the background model eB∗

t derived by our approach is only a rough

approxi-mation to the MAP solution, since (2.2) is not exactly solved. In fact, to optimize (2.2), the underlying distributions of the probability terms should be further spec-ified, and complicated optimization techniques, e.g., EM-based estimations, may need to be employed. Hence, instead of pursuing the MAP solution, our focus is on the design of a practical and efficient algorithm for background model estimation.

2.2.2 The Detailed Algorithm

Bottom-up process

We start by applying f∗ _{to each b}i

t to determine its probability of being a

back-ground block. A simplified notation P (bi

t| f∗) will be hereafter adopted to denote

such a probability, with the understanding that the most recent ℓ ith-blocks bi_s

(i.e., bi

t,ℓ) are available for calculating useful features, e.g., optical flow values, for

classification. Observe that only for those image blocks classified as background at each time t, their corresponding block-wise updatings would modify the back-ground model. It is therefore preferable to have as few false positives by f∗ _as

possible. Hence we use a strict thresholding τ∗_{, i.e., the decision boundary of f}∗_,

on P (bi

t | f∗) such that image blocks with P (bit | f∗) > τ∗ ≥ 0.5 are considered

background. Given this setting, there are two possible cases for a block updating. • If bi

t is not a background block, then eb∗it = eb∗it−1, i.e., the pixel means and

variances of eb∗i

(39)

Algorithm 1: Background model estimation via classification

Data: Process It using f∗, eB_t−1∗ and an auxiliary image ¯B ={¯bi}. When

t = 0, we have eB∗

0 =∅, ¯B =∅, and ∀i, age(i) = 0, counter(i) = 0,

and replace(i) = f alse. Result: Obtain a MAP estimate eB∗

t.

begin e B∗

t ←− eBt−1∗

for image block bi

t∈ It do

if bi

t is a valid background block then

if diss(bi

t,eb∗it−1)≤ δ(= 152 = 225) then

/* Maintenance */ eb∗i

t ←− IterativeAverage(bit,eb∗it−1)

age(i)←− age(i) + 1 counter(i)_{←− 0} ¯bi _{←− 0} else /* Replacement */ ¯bi _{= ¯b}i₊ 1 Nb i t counter(i)←− counter(i) + 1 if counter(i) = N then eb∗i t = ¯bi age(i)_{←− 0} counter(i)←− 0 replace(i)_{←− true} else age(i)_{←− age(i) + 1} counter(i)←− 0 ¯bi _{←− 0} output eB∗ t

(40)

• If bi

tis classified as a background block, we measure the dissimilarity between

bi

t and eb∗it−1 by

diss(bit,eb∗it−1) =

kbi

t− eb∗it−1k2

|bi_| ,

where kbi

t− eb∗it−1k2 is the sum of squared pixel intensity differences, and |bi|

is the block size. Depending on the value of diss(bi

t,eb∗it−1), either a

main-tenance step or a replacement step is invoked (see Algorithm 1). We apply the iterative maintenance formulas proposed in [6] to update the latest small variations into eB∗

t. Notice that a block replacement in evaluating eBt∗ takes

place only when the particular block has been classified as background for N consecutive frames. Indeed, the maintenance phase is designed to adapt the gradual variations, and the replacement phase is to accommodate new stationary objects.

Top-down process

A top-down process based on comparing eBtwith eB_t−1∗ is employed to detect isolated

block updates in the bottom-up evaluation of eBt, and undo these updates with the

statistical data from eB_t−1∗ . 2 Conveniently, in implementing the algorithm, the top-down process can be carried out right after the background block classifications. This would yield a set of valid background blocks; all of them are not isolated. Hence, the bottom-up updatings over these valid blocks would directly lead to the final estimate eB∗

t. 2

An isolated block updating (either for maintenance or for replacement) has less than three of its 4-connected neighboring blocks being updated in the bottom-up process.

(41)

The background model

Having described our two-phase scheme to iteratively improve eB∗

t, we are now in

a position to define a meaningful and steady initial background model eB∗_.

Definition 1. The initial background model eB∗ _{is said to be e}_B∗

t∗, if t∗ is the earliest

time instant satisfying the following three conditions:

(i) there is no block replacement occurred for the last N image frames, i.e., in calculating eB∗

t∗_−N+1, eB_t∗∗_−N+2,· · · , eB_t∗∗;

(ii) all image blocks in eB∗

t∗ have been replaced at least one time since t = 0; and

(iii) they are of ages at least L. (In all our experiments, we have N = 45 and L = N + 15 = 60.)

2.3 Fast Classification with Soft Margins

In this section, issues related to the feature selection and the classifier formulation are addressed for the construction of an efficient background block classifier. In the feature selection, we have chosen to use features as general as possible so that the resulting classifier can handle a broad range of image sequences. Regarding the classifier formulation, two learning methods, support vector machines (SVMs) and column generation boost (CGBoost), are explored by investigating the following two issues. First, rather than binary-value classifiers, a classifier with probability outputs is required for our application. Second, the efficiency of the resulting classifier should fulfill the demand of real-time performance.

(42)

2.3.1 Feature Selection

For our purpose, the task of training is to learn a binary classifier for identify-ing background blocks from a video sequence captured by a static camera. We use a two-dimensional feature vector to characterize an image block bi_{. The first}

component is the average optical flow value, where we apply the Lucas-Kanade’s algorithm [39] to compute the flow magnitude of each pixel in bi_{. In our}

implemen-tation, it takes three image frames, I_t−2, I_t−1, and It, to calculate the flow values

properly. However, we note that if one-frame delay is allowed, a slightly better results in evaluating the values of optical flow can be achieved by referencing I_t−1, It, and It+1. The second component of a feature vector is derived from the (mean)

inter-frame image difference by _|bi_|−1P

(x,y)∈bi|I

x,y

t−1− I

x,y

t |. To ensure good

classi-fication results, the feature values of both dimensions are normalized into [0, 1] for training and for testing.

The two feature components are discriminant enough for our application owing to their generality and consistency in classifying background blocks of varied image sequences. We should also point out that since the optical flow values are computed using just three consecutive image frames, it may occur that a few pixels would have erratic/large flow values. Hence, an estimated upper-bound threshold is enforced to eliminate such errors. On the other hand, the additional cue using temporal differencing is more stable and easier to calculate, but it may fail to detect all the relevant cases. For example, the inter-frame difference may not be small in evaluating a background block that consists of slightly waving trees. Instead, an optical flow value is more informative to capture such a background block with small motions.

(43)

2.3.2 SVMs with Probability Outputs

For binary classifications, SVMs determine a separating hyperplane fS(x) = w·

φ(x), x_{∈ D by transforming D from the input space to a high dimensional feature} space, through a mapping function φ. The optimal hyperplane f∗

S can be obtained

by solving the following soft-margin optimization problem: min w_{, ξ}_i 1 2kwk 2_{+ C}+ S X i+ ξi+ CS− X i− ξi (2.3) subject to yifS(xi)≥ 1 − ξi, i = 1, . . . , m,

where ξi ≥ 0 are slack variables for tolerating sample noises and outliers. The two

parameters C_S+ and C_S− are useful when dealing with unbalanced training data. (Recall that “+” is for background image blocks and “_{−” for foreground image} blocks.) For the sake of reducing false positives, which may lead to more serious flaws in the estimated background model than false negatives would cause, C_S− is given a value four times larger than the one for C_S+ to penalize more the misclas-sifications of foreground blocks. In solving (2.3), we use a degree 2 polynomial kernel to yield satisfactory classification outcomes efficiently.

Probability output

We use a sigmoid model to map an SVM score into the probability of being a background block by

P (x_|f_S∗) = 1 1 + exp(A f∗

S(x) + B)

(44)

where the two parameters A and B can be fitted using maximum likelihood es-timation from D. Following [45], a model-trust algorithm is applied to solve the two-parameter optimization problem. In our experiments, 65% of the training blocks are used for deriving an SVM, and the other 35% are for calibrating proba-bility outputs. The two fitted parameters are A =−0.673724 and B = −2.359339.

2.3.3 CGBoost with Probability Outputs

Among the many variants of boosting methods, the AdaBoost, introduced by Freund and Schapire [16], is the most popular one to derive an effective ensemble classifier iteratively. While AdaBoost has been proved to asymptotically achieve a maximum margin solution, recent studies also suggest the adoption of soft margin boosting to prevent the problem of overfitting [12], [46]. We thus employ the linear program boosting proposed by Demiriz et al. [12] for achieving soft-margin distribution over the training data D and acquiring an ensemble classifier fB =

PT

j=1αjfj, which is comprised of T weak learners fjs and weights αjs. Actually,

Demiriz et al. apply a column generation method to solve the linear program by part, and establish an iterative boosting process that is similar to AdaBoost. Note that in implementing the CGBoost, the weak learners are constructed from radial basis function (RBF) networks, denoted as hs [46]. And each h has three Gaussian hidden units where two of them are initialized for the background, and the remaining one is for the foreground training data. Let fj(x) = sign(hj(x)) be

the weak learner selected at the jth iteration of CGBoost. Then, the RBF network hj is derived by minimizing the following weighted error function

Ej = 1 2 Xm i=1wi(hj(xi)− yi) 2_, _(2.5)

(45)

where _{wi} is the weight distribution over training data D at the jth iteration.

Probability output

Different from (2.4), it is more convenient to link boosting scores to probabilities. Friedman et al. [18] have proved that the AdaBoost algorithm can be viewed as a stage-wise estimation procedure for fitting an additive logistic regression model. Consequently, a logistic transfer function can be directly applied to map CGBoost scores to posterior probabilities by

P (x|f∗ B) = 1 1 + exp(−2f∗ B(x)) , (2.6)

where the mapping in (2.6) is valid when the training data D do not contain a large portion of noisy samples or outliers. For the general case, it should still yield reasonable probability values with respect to the classification results by f∗

Bs.

To summarize, both the two classifiers, f∗

S and fB∗, seek a soft-margin solution

when deciding a decision boundary for the training data D. They indeed achieve similar classification performance in our experiments. However, SVMs are gener-ally less efficient than boosting, as the number of support vectors increases rapidly with the size of D. We thus prefer a CGBoost classifier for estimating an initial background model.

2.4 Experimental Results

To demonstrate the effectiveness of our approach, we first describe how the clas-sifiers are learned for the specific problem. We then test the algorithm with a

(46)

Figure 2.4: Training images for background model initialization. Examples of col-lected images and their binary maps of the foreground (white) and the background regions (black) are plotted, top and bottom, respectively.

number of image sequences on a P4 1.8GHz PC. Through illustrating with the experimental results, we highlight the advantages of learning a background model by classification, and make comparisons with those related works. Finally, possible future extensions to the current system are also explored.

2.4.1 Classifier Training

Training data

We begin by collecting images that contain moving objects of different sizes and speeds from various indoor and outdoor image sequences captured by a static camera. These images are analyzed using a tracking algorithm (with known back-ground models), decomposed into 8_{× 8 image blocks, and then manually labeled} as +1 for background blocks, or_{−1 for foreground ones. Examples of the collected} images and the detected foreground and background regions are shown in Fig. 2.4. Since we prefer a resulting classifier to accommodate small variations, image blocks from regions of faint shadows or lighting changes are labeled as background. The feature vector of an image block can be computed straightforwardly by referencing

(47)

the related ℓ = 3 blocks from the respective image sequence. Totally, there are 27, 600 image blocks collected to form the training data D. As shown in Fig. 2.5 (a), the features extracted from the background blocks are mostly of small val-ues, while those extracted from the foreground blocks mostly have feature values corresponding to the regions of large motions.

Classifier evaluations

The training and the classification outcomes by implementing the classifier re-spectively with f∗

S and fB∗ are summarized in Table 2.1. Owing to the soft-margin

property of the two classifiers, almost the same training errors have been obtained. However, the classification efficiency of f∗

B is more than 20 times faster than that

of f∗

S. To visualize the distribution of a derived classifier, for example, fB∗, its level

curves of the decision scores are plotted in Fig. 2.5 (b). It can be observed that the area of positive scores is located near the lower-left corner, which is consistent with the distribution of feature values computed from the training data.

Probability thresholding

For the sake of reducing false positives, we adopt a stricter probability threshold τ∗ _{= 0.6 in setting the decision boundary of a CGBoost classifier. This value is}

determined through 10-fold cross validation. In Table 2.2, the average values of the false positive and false negative rates in cross validation with respect to different threshold settings are listed. While false negatives mainly affect the needed time in estimating an initial background model, the false positives, i.e. misclassifying foreground blocks as background, will have direct impacts on the quality of the

(48)

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

Normalized Optical Flow

Normalized Inter−frame Difference

Foreground (Blue) Background (Red) (a) 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

Normalized Optical Flow

Normalized Inter−frame Difference

−0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 (b)

Figure 2.5: Distributions of training data and training results. (a) The distribution of the training data. The training features are normalized to the values between 0 and 1. In order to detail the distribution of background samples, only the part of 0 to 0.5 is plotted. (b) The level curve of f∗

B’s decision scores. The zero-score

(49)

Table 2.1: Comparisons between SVMs and CGBoost (using AdaBoost as a bench-mark).

Classifier SVM f∗

S CGBoost fB∗ AdaBoost fA

Settings Image Size: 320× 240, Platform: P4-1.8GHz PC Parameters C_S+ = 20, CB = ₂₇₆₀₀10 (None)

C_S−= 80

Components 4185 SVs 33 fjs 33 fjs

Error Rate∗ _0.0466 _0.0466 _0.0507

Test Speed 0.4fps 9.5fps 9.5fps

* Error Rate = # of Misclassified Blocks / # of Training Blocks

Table 2.2: Average error rates of 10-Fold cross validation in different threshold settings

τ∗ _0.5 _0.6 _0.7

False Positive 0.02912 0.02768 0.01196 False Negative 0.01877 0.02062 0.49652

in that it causes fewer false positives without introducing too many false negatives.

2.4.2 Performance Evaluation

Since the classification efficiency of CGBoost is more than 20 times faster than that of an SVM implementation (see Table 2.1), we describe below only the experimen-tal results yielded by using the CGBoost classifier f∗

B. For testing the generality

of the proposed scheme, all the to-be-estimated scenes of the testing sequences are completely different from those of the training data. The testing sequences also contain complex motions, e.g., substantial object interactions, and varied lighting conditions, like cloudiness.

(50)

(a) _{A000, e}B∗

0 (b) A044, eB44∗ (c) A261, eB261∗ (d) A510, eB510∗ (e) A650, eB650∗

(f)_{A000− e}B∗

650 (g) A044− eB650∗ (h) A261− eB650∗ (i) A510− eB650∗ (j) A650− eB650∗

Figure 2.6: Results of background model initialization. (a)–(e) The upper row shows image frames from sequence _{A, and the lower row depicts the progressive} estimation results. The initial background model is completed at t∗ _{= 650. (f)–(j)}

The frame subtraction results by referencing the derived background model eB∗ 650.

Background model initialization

We first demonstrate the efficiency of our method for an outdoor environment. The sequenceA contains different types of objects, including slightly waving trees, walking people, slow and fast moving vehicles, and even a stationary bike rider. We shall use this example as a benchmark to analyze the quality of our results, detection rates, and comparisons to other existing algorithms. As illustrated in Figs. 2.6 (a) and (b), the background model is initialized into an empty set at t = 0, and it is until the 44th frame that stationary regions of the scene are started to be incorporated into the model (due to N = 45 in our setting). Fig. 2.6 (c) shows a very slow moving car is falsely adapted into the background in transient (and is eventually removed after its leaving the scene). More interesting is the scenario

(51)

(a) _A020 (b) _A166 (c) _A261 (d) _A372 (e) _A540 Figure 2.7: Results of background block detection. Row one: Image frames from sequence A. Row two: The manually labeled foreground (white) and background (black) maps. Note that the very slow-moving car in (c) that later becomes fully stationary in (d) is labeled as foreground and background, respectively. Row three: Our background block detection results. The foreground blocks in gray are identified by the top-down validation process.

depicted in Figs. 2.6 (d) and (e) that a bike rider waiting for a green traffic light has remained still long enough to become a part of the derived back-ground model at t∗ _{= 650. Then the system can start to track objects via frame differencing and}

proper model updating. On the other hand, if we subtract the model from the first t∗ _{frames, it gives the complexity of how the background model is initialized.}

Factors such as dark shadows and waving trees can now be easily identified from those shown in Figs. 2.6 (f)-(j).

Background block detection

To quantitatively evaluate the accuracy of the bottom-up block classifications and the improvement with the top-down validations, we select twenty image frames

(52)

Table 2.3: Detection error rates with and without top-down validation. BG/FG Block Without With %

Detection top-down top-down Improvement Detection Error Rate∗ _0.04142 _0.03779 _{8.764 %}

False Positive Rate 0.02246 0.01825 18.744 % False Negative Rate 0.01896 0.01954 -3.059 %

* Detection Error Rate = # of Misclassified Blocks / # of Testing Blocks

from sequence_{A that contain moving objects of different sizes and speeds, specular} light, and shadows. We then manually label each image block of the twenty frames to result in a set of 20061 background and 3939 foreground blocks, where we shall use them to examine the accuracy of our scheme for background block detection. In Fig. 2.7, we show results for five selected frames. Note that those gray blocks are detected as foreground through the top-down validation process. To further justify the need of a local and global approach, a comparison of the detection error rates with or without the top-down validation step is given in Table 2.3. Though the values of detection rates could vary from testing our system in different environments, it is clear that the improvement of reducing the errors by applying the top-down validation is significant. As in this example, the reduction rate of false positives is about 18.744% while the increase rate of false negatives is only 3.059%. Two observations could arise from the foregoing verification for the accuracy of our scheme in detecting background blocks.

• For the classifier to accommodate small variations like waving trees, it may mistakenly classify very slow-moving objects into background (see Fig. 2.6 (c) and Fig. 2.7 (c)). This is indeed a trade-off, and we resolve the issue by learning a proper decision boundary from the training data.

(53)

• Our classification scheme may suffer from the aperture problem in detecting large objects in that we use motion features to construct a general classifier (see Figs. 2.7 (a) and (c)). With the top-down validation, this problem can be alleviated to some degree. Still a number of false positives caused by the aperture problem exist frame-wise. However, since only the same false positive occurring for N consecutive frames would be adapted into a background model, such an event rarely happens in practice (with a very low probability, e.g., around 0.01825N _{for the example in Table 2.3).}

Feature selection

In our design, two general motion cues, the inter-frame difference and the optical flow value, are adopted to discriminate background scenes. While the inter-frame difference is effective in detecting static background blocks, the optical flow value, on the other hand, provides discriminability in classifying image blocks in small motions into gradually-varying background or moving foreground. To further jus-tify the use of the optical flow cue, additional evaluations using the inter-frame difference alone are provided. With the best setting of the difference threshold at 0.013, the training error is raised from 0.0466 to 0.0528 (or a 13.3% increase), and the testing error for the 20 evaluation image frames increases from 0.0378 to 0.0436 (or a 15.3% increase). Hence, the benefit of incorporating the optical flow value is obvious.

Parameter settings

(54)

(a) N = 30 (b) N = 45 (c) N = 60 (d) N = 75 (e) N = 90 Figure 2.8: Results of different parameter settings. In each case, we show the image frame It∗ (above) that our system completes its estimation for a MAP

initial background model (below). Only for N = 30, it would produce an unstable estimation due to the violation of stationary criterion. Different values of N and L (given in Definition 1) mainly affect the needed time to derive a stable background model. Respectively, it takes 403, 650, 678, 1001, and 1031 frames to compute the initial background models.

scheme.) Specifically, we have experimented with N = 30, 45, 60, 75, and 90. We show in Fig. 2.8 that, with different values of N and L, it mainly affects the needed time to compute a stable initial background model. The larger the value of N is, the longer period of time it takes to complete the estimation. Except for N = 30, which is too short a time period for yielding a stationary adaptation, all other settings of N lead to stable background models.

Comparisons of Background Model Completeness

A clear advantage of our formulation is the ability to know when a well-defined initial background model is ready to be used for tracking. We demonstrate this point by making comparisons with the popular mixture of Gaussians model [54] and the local image flow approach [22]. While the two methods are also effective for background initialization, they both lack a clear definition of what an underlying

(55)

background scene is at any time instant of the estimation processes. For systems based on the mixture of Gaussians, they work by memorizing a certain number of modes for each pixel, and then by pixel-wise integrating the most probable modes to form a background model. This is in essence a local scheme that the overall quality of a background model is difficult to evaluate. On the other hand, the method described in [22] is designed to process a whole image sequence to output a background model. We thus need to modify the algorithm into a sequential one so that the comparisons can be done by frame-wise examining the respectively derived background models.

The first experiment is carried out with image sequence A where the three algorithms are alternately run till the image frame t∗ _{= 650 that our method}

completes its estimation for an initial background model. For the mixture model, we use three Gaussian distributions and a blending rate of 0.01, and initialize the background model at t = 0 to the first image frame. For the local image flow implementation, the values of w and δmax are set to 30 and 15, and the

background model is an empty set at t = 0. In Fig. 2.9, we show some intermediate results of ours and the corresponding background models produced by the other two methods. Due to the batch nature of the local image flow scheme, its three background models shown in Fig. 2.9 are obtained by running the algorithm three times, using the respective periods of image frames as the inputs. Overall, the results produced by ours and the mixture of Gaussians are more reliable than those of the local image flow, largely because the local flow scheme relies heavily on the estimations of optical flow directions and their accuracy. While the outcomes by the mixture of Gaussians seem to be satisfactory and similar to ours, the absence of

(56)

(a) _A000 (b) _A044 (c) _A261 (d) _A650 Figure 2.9: Comparisons of [22], [54], and our approach. Row one: Images frames from image sequence _{A. Row two: The intermediate results of background} es-timation by our method that completes at t∗ _{= 650. Row three and four: The}

results respectively derived by the local image flow approach [22] and the mixture of Gaussians method [54] at each corresponding time instant.

(57)

remains a disadvantage of the approach. Furthermore, as one would expect that a mixture of Gaussians method should be sensitive to lighting variations in that it is done by locally combining pixel intensities. We shall further elaborate on this issue with the next experiment.

Our second comparison focuses on the effects of lighting changes. For the outdoor sequence _{B (see Fig. 2.10), the lighting condition varies rapidly due to} overcast clouds. And the experimental results show that our method is less sensi-tive to variations of this kind. Specifically, in Figs. 2.10 (e) and (f), we enlarge the sizes and enhance the contrasts of the two derived background models for a clearer view. Note that especially in the road area our background model estimation is clearly of better quality than the one yielded by the mixture of Gaussians. This is mostly because of our uses of motion cues for identifying background blocks and the properties of the MAP background model for integrating local and global consistency. On the other hand, the mixture of Gaussians approach uses only the pixel-wise intensity information so that its performance depends critically on the variations of intensity distribution about the background scene.

Initialization and tracking

To further illustrate the efficiency of using our proposed algorithm to estimate a background model for tracking, we show the estimations of initial background mod-els of test sequences _{C and D, and some subsequent tracking results in Fig. 2.11.} Below each depicted image frame It, the corresponding background model eBt∗ is

plotted. In the two experiments, the estimations of the initial background model e

B∗

雙階層視訊分析 – 由靜態背景模型到動態前景切割

國

國

國 立

立

立 交

交

交

通

通

通 大

大

大 學

學

學

資

資

資訊

訊

訊科

科

科學

學

學與

與

與工

工

工程

程

程研

研

研究

究

究所

所

所

博

博

博 士

士

士

論

論

論 文

文

文

雙

雙

雙階

階

階層

層

層視

視

視訊

訊分

訊

分

分析

析

析 –

–

– 由

由

由靜

靜

靜態

態背

態

背

背景

景

景模

模

模型

型

型到

到

到動

動

國立

立交

通大

大學

博士

論文

_{析 –}

_–

_{– 由}

_{− from Static}

研究

究生