以人類為基礎的視訊處理及其在監控上的應用

(1)

ʳ

୯

!

ҥ

!

Ҭ

!

೯

!

ε

!

Ꮲ

!

ၗૻࣽᏢᆶπำࣴز܌!

റ

!!

!

γ

!

!!

!

ፕ

!

!!

!

Ў

!

аΓᜪࣁ୷ᘵޑຎૻೀ౛Ϸځӧᅱ௓΢ޑᔈҔ

!

Human-based Video Processing and its Application to Surveillance

ࣴ!ز!ғǺᒘػᗱ!

ࡰᏤ௲௤ǺᄃѶྍ!!௲௤!

!

ύ

ύ!

!!

!๮

๮

๮!

!!

!҇

҇!

҇

!!

!୯

୯

୯ ୯!

!!

!΋ԭ႟΋

΋ԭ႟΋!

΋ԭ႟΋

!!

!ԃ

ԃ

ԃ!

ԃ

!!

!΋

΋

΋!

΋

!!

!Д

Д

Д!

Д

(2)

аΓᜪࣁ୷ᘵޑຎૻೀ౛Ϸځӧᅱ௓΢ޑᔈҔ!

Human-based Video Processing and its Application to Surveillance

ࣴ ز ғǺᒘػᗱ StudentǺYu-Chun Lai

ࡰᏤ௲௤ǺᄃѶྍ AdvisorǺHong-Yuan Mark Liao

୯ ҥ Ҭ ೯ ε Ꮲ

ၗ ૻ ࣽ Ꮲ ᆶ π ำ ࣴ ز ܌

റ γ ፕ Ў

A Thesis

Submitted to Institute of Computer Science and Engineering College of Computer Science

National Chiao Tung University in partial Fulfillment of the Requirements

for the Degree of Doctor of Philosophy

in

Computer and Information Science January 2012

Hsinchu, Taiwan, Republic of China

(3)

學生：賴育駿指導教授：廖弘源博士林正中博士

國立交通大學資訊科學與工程研究所博士班

摘

摘要

要

近年來以人類為基礎的視訊處裡一直是相當熱門的研究題目，主要原因為人類通常是影片中的拍攝對象，例如，電影，監視器影片，以及運動影片。因此，如果可以在視訊影片中針對人類的部分加以處理，對於視訊內容的分析會相當的有幫助。常見的人類為基礎的視訊處理包含了人類偵測，切割，以及動作辨識等技術。並且可分為針對儲存影片的 off-line 處理以及針對即時環境的 on-line 處理。在本論文中，我們提出以人類為基礎的視訊處理技術並且將其應用在智慧型監控系統上。在第一項的研究中，我們針對已經錄製完畢的影片，提出了以背景資訊為基礎的場景分割方法。我們利用 Mosaic 的技術將屬於前景部分(通常是人物的部分) 的資訊移除並且試著重建被遮蔽的背景。接著根據背景的資訊取出低階的視覺特徵來估測影片中兩 shot 之間的相似程度。並且參考電影製作的法則將 shot 群組化找出影片中不同場景之間的邊界位置。在找尋出場景邊界位置之後，可以簡化後續的視訊分析工作。在第二項的研究主題中，我們針對即時的監控系統提出我

(4)

提供較清楚的影像，因此非常符合智慧型監控系統上的需求。因此，我們提出一個線性生產規劃賽局(Linear Production Game)解法來控制攝影機網路中主動式

Pan Tilt Zoom 攝影機的參數。我們提出的非線性函式可以更加有效的攝影機去追

蹤多個觀測目標，並且經由參數的拓展以及加上新的線性限制條件，可以轉換為一個線性生產規劃賽局(Linear Production Game)。由於線性生產規劃賽局可以在多項式時間內求得最佳解，因此我們提出的方法相當有效率以及精確。在第三項的研究主題中，我們針對人類動作辨識問題提出一個以局部特徵為基礎的辨識技術。我們根據局部特徵的表示法提出一個人類動作辨識的架構。兩種不同的局部特徵，包含動作的長期趨勢以及短期外型變化分別被抽取出來用來描述人類動作。最後經由 adaboost 的學習方法取出具有鑑別力的局部特徵組合來辨識人類動作。

(5)

Student：Yu-Chun Lai Advisors：Dr. Hong-Yuan Mark Liao Dr. Cheng-Chung Lin

Institute of Computer Science and Engineering

National Chiao Tung University

Abstract

In recent years, human-based video processing has attracted a great deal of attention

in the field of computer vision. This is because human usually is the major subject in

a video such as movie, surveillance video, and sport video. Therefore, a video

processing technique based on human can provide rich information for video content

analysis. Generally, common human-based video processing includes human

detection, human segmentation, human motion recognition, and so on. Furthermore,

according to the real-time requirements of an application, it can be categorized to the

off-line processing for a video storage and the on-line processing for a real-time

environment. In this dissertation, we put our emphasis on the human-based video

processing and apply these techniques to an intelligent surveillance application. In the

first topic, we propose a scene segmentation approach based on the analysis of

background information for the off-line processing. The mosaic technique is utilized

(6)

integrated to compute the similarity measure between two shots; moreover, the rules

of film-making are used to guide the shot grouping process. After the boundaries

among different scenes are detected, the following video analysis processing can be

simplified. In the second topic, we proposed an active camera network

reconfiguration technique for an on-line surveillance system. Since an active camera

(for example, a pan, tilt, zoom camera) be able to fixate a human subject to obtain a

large view of people, it is suitable for intelligent surveillance system. Therefore, a

camera network reconfiguration solution is proposed to adjust pan, tilt, and zoom

parameters in a PTZ camera network for video surveillance application. The

non-linear objective function we proposed better utilizes a network's cameras to track

multiple targets. We also show that, by expanding the unknown parameters and

imposing new constraints, the non-linear objective function can be converted into a

linear production game (LPG) problem. Since an LPG yields an optimal solution that

can be evaluated in polynomial time, the proposed method is efficient and accurate. In

our third work, a human motion recognition framework based on local feature

representation is proposed. A clay based feature to describe long-term movement

trend and a motion history image (MHI) based feature to describe short-term shape

(7)

(8)

誌

謝

漫長的求學階段終於畫上了句點，在這段期間中受到許多人的照顧以及幫助，謹在此向幫助我的師長、家人、以及朋友們獻上我感謝的心意。首先要感謝是廖弘源老師，感謝老師提供了這麼好的環境可以讓我無後顧之憂的在研究上發展，並且以身作則教導我們什麼是做研究的態度跟思考方式，如果不是在碩士班的時候有機會參加廖老師實驗室的 meeting 並從中發現更多研究上的樂趣，我想我也許就不會走上學術這條路。同時也要感謝陳良華老師，不僅在課業以及研究上給予我很多的幫助，在學期間也給我許多鼓勵以及建議。感謝石勝文老師，在十分忙碌的生活中撥空指導我論文的內容以及寫作上的方法。感謝林正中老師在系上事務上給予的許多幫助以及建議。在此並感謝百忙中抽空指導我口試的莊仁輝教授、林志青教授、林嘉文教授與賴尚宏教授，對本論文的指導以及建議。感謝我的父母，對我的任何選擇都如此的支持，我想我是非常幸運的人，可以追求自己想要做的事情，自由的選擇自己的路。雖然路途並非非常順暢，常常讓父母擔心，不過總算是結束學業上的部分，希望往後的路可以讓父母不再為我擔憂。感謝我的兄長，一起討論研究以及提供我許多就業上的資訊。感謝在身旁的學長們，學億學長以及志文學長在我剛進實驗室，不知如何是好的時候給了我許多的建議跟協助，使我能夠慢慢的習慣研究的生活，並且有了可以討論研究以外的事情的對象。感謝敦裕學長適時的提點我要多花心思在重要的研究部分，使我避免在研究上繞的更遠。感謝祐銘學長在求學期間一路上的作伴，使我不會有一人孤軍奮戰的感覺，在學期間常常麻煩學長花時間跟我討論以及給我建議，真的是非常感謝。感謝士韋學長、易聰學長以及家棟，跟你們一起討論總是可以激發我新的想法跟看法。

(9)

時就一直給我許多的幫助，直到離開實驗室都還需要兩位的幫忙，真的是非常感謝。其它還有許多在中研院以及交通大學的老師與同學們，謝謝你們給予我的教導以及協助。

求學期間我很幸運的可以認識這些良師以及益友，這會是我人生之中珍貴的寶藏。除了感謝各位之外，也誠心的祝福各位能夠幸福快樂並達成自己的目標。

(10)

1 Introduction 1

1.1 Motivation . . . 1

1.2 Overview of the Dissertation . . . 2

1.2.1 Scene Segmentation . . . 3

1.2.2 PTZ Camera Network Reconfiguration . . . 3

1.2.3 Human Motion Recognition . . . 3

1.3 Dissertation Organization . . . 4

2 Scene Segmentation 5 2.1 Introduction . . . 5

2.2 Background and Motivation . . . 7

2.3 Mosaic Construction . . . 8

2.4 Shot Similarity Measure . . . 11

2.5 Scene Extraction . . . 16

2.6 Experimental Results . . . 20

2.7 Concluding Remarks . . . 22

3 PTZ Camera Network Reconfiguration 26 3.1 Introduction . . . 26

3.2 Problem Formulation . . . 30

(11)

3.3 The Proposed Approach . . . 33

3.3.1 Linear Production Game (LPG) Problem . . . 37

3.4 Simulations and A Real World Experiment . . . 39

4 Human Motion Recognition 51 4.1 Introduction . . . 51

4.2 Local Feature-based Human Motion Recognition Framework . . . 55

4.3 Local Feature-based Representation . . . 57

4.3.1 Whole Body Motion Alignment . . . 57

4.3.2 Clay Representation of Trajectories (Long-term Feature) . . . 58

4.3.3 Accumulated Moving Edge Patches (Short-term Feature) . . . 60

4.4 Learning and Recognition . . . 63

4.5 Experimental Results . . . 65

4.5.1 Leave-One-Out Cross-Validation Strategy . . . 66

4.5.2 Noisy Environment Testing: Random Discarding Features . . . . 67

4.5.3 Noisy Environment Testing: Long-term Stationary Discarding . . 68

4.5.4 Weizmann Dataset Testing . . . 69

4.5.5 Limitations and Discussion . . . 70

5 Conclusion and Future Work 72 5.1 Conclusion . . . 72

5.2 Future Work . . . 74

(12)

2.1 A frame of a baseball game sequence and its corresponding motion vectors. 11

2.2 Sampled frames of a ”Terminator II” sequence. . . 11

2.3 The static mosaic of the ”Terminator II” sequence. . . 12

2.4 Two images with different texture coarseness. . . 15

2.5 The detected edge image is partitioned into a set of 16 × 22 blocks. . . . 16

2.6 Sigmoid function with parameters a = 10 and b =-5. . . 16

2.7 Finding similar color object pairs. . . 17

2.8 Computing the similarity between every two adjacent shots. . . 20

2.9 Shots are grouped into several candidate scenes. . . 20

2.10 An expanding scene and the two subsequent scenes. . . 20

2.11 Shot grouping result of ”Lgerca lisa 1”. . . 24

2.12 A scene extracted from ”Little Voice”. . . 25

3.1 The decomposition of an FOV containing T i j targets into 2| T_ji| −1 virtual FOVs. . . 34

3.2 The simulated surveillance environment with three PTZ cameras and fif-teen moving targets in Pedestrian set 1 . . . 41

3.3 (a) The true number of moving objects in Pedestrian set 1 and the num-ber observed by the exhaustive search (ES) method; (b) the normalized numbers of targets observed by the compared methods (Pedestrian set 1). 43 3.4 The snapshot of NOF-LPG and LOF-LPG at time instance 21. . . 44

(13)

3.6 (a) A panoramic image derived by integrating 24 images; (b) the cover-age regions of the three cameras overlaid on a top view imcover-age and the panoramic image of the three cameras. . . 47 3.7 Detected targets (a single person, a group of people, a car, and noise) . . . 48 3.8 The tracking results of frames 1470, 3358, and 3444 using NOF-LPG. . . 49 3.9 The average number of targets identified by the ES algorithm in the video

sequences used to test the NOF-LPG (video sequence 1), LOF-LPG (video sequence 2), LIM (video sequence 3) and SONG (video sequence 4) methods. . . 50

4.1 A concept of local feature-based representation and object detection. . . . 54 4.2 An example includes two different human motions, their posture, static

regions, and moving regions. . . 55 4.3 An overview of the proposed human motion recognition framework. . . . 56 4.4 (a) Trajectories generated by the sampling points of the same moving

articulated part; (b) generating trajectories in a real case; (c) trajectories after clustering; (d) sampling points clustered at a timestamp; (e) voting neighboring regions based on sampling points; (f) generated candidate regions at a timestamp; (g) extracted trajectories. . . 60 4.5 (a) A piece of clay deformed by a sequential motion vectors according to

the time orders; (b) a 2D clay pattern produced from a trajectory (red line in the left part). . . 61 4.6 The sampling and the accumulated moving edge patch extraction process

[right-most: red dots: reference points; patch center point-¿patch ’cen-troid’ point]. . . 63 4.7 (a) The concept of 2D Gaussian voting process [17]; (b) a sliding window

(14)

4.9 (a) A randomly discarded pattern (30% of the features dropped); (b) an example of the remaining information (50% dropped) . . . 68

(15)

2.1 Accuracy measures for six test videos. . . 22

2.2 Performance comparison for scene extraction. . . 22

3.1 Symbol table used in Section 3.2 . . . 30

3.2 Symbol table used in Section 3.3 . . . 35

3.3 The required target size of different surveillance applications suggests in [1]. 35 3.4 Symbol table used in Section 3.3.1 . . . 37

3.5 The average computation time required for one iteration by the NOF-LPG, LOG-NOF-LPG, SONG, LIM, and ES methods . . . 42

3.6 The percentages of targets observed by the compared methods . . . 42

3.7 The percentages of targets observed by the compared methods in noisy pedestrian sets . . . 44

3.8 The percentages of targets observed by the compared methods in a large scale scene. . . 46

3.9 The percentages of targets observed by the compared methods in a real environment . . . 49

4.1 The confusion matrix illustrates the performance of our approach. . . 67

4.2 The recognition rates achieved when different percentages of the extracted features were dropped. . . 67

(16)

are dropped. . . 69 4.4 The resultant confusion matrix using the Weizmann dataset. . . 70 4.5 The recognition results derived by non-background modeling approaches 70

(17)

Introduction

1.1 Motivation

In recent years, human-based video processing has become a popular research topic in the field of computer vision. This is because human usually is the major subject in a video such as a movie, surveillance video, and sport video. Therefore, a video processing tech-nique based on human can provide rich information to video content analysis. Generally, common human-based video processing includes human detection, human segmentation, human motion recognition, and so on. Furthermore, according to the real-time require-ments of an application, it can be categorized to the off-line processing and the on-line processing. These techniques can be widely applied for important applications, such as video surveillance [2, 3], video annotation [4, 5], and human computer interface [6, 7]. However, human-based video processing is difficult due to two reasons. First, because a conceptual human body movement is combined by complex articulated motions belong-ing to a huge number of segments, handlbelong-ing for complex motion with high degrees of freedom is necessary. Second, since processing human video in each frame is limited, a set of frames, especially spatial-temporal information, utilized to process human video is an alternative. In this dissertation, we focus on handling these challenging issues of the

(18)

human-based video processing and its application to surveillance.

1.2 Overview of the Dissertation

In this dissertation, we put our emphasis on the human-based video processing in surveil-lance application. In a surveilsurveil-lance system, human-based video processing includes the off-line processing for a video storage and the on-line processing for a real-time surveil-lance environment. We proposed different human-based video processing techniques for the both categories. For the off-line processing, we proposed a scene segmentation tech-nique to appropriately segment video into story units according the video content. We select a scene as a story unit based on the storyboard used in film-making. After the boundaries among different scenes are detected, the following video analysis processing can be simplified. For the on-line processing, we focus on two important issues in a tra-ditional closed-circuit television (CCTV) surveillance system. The first issue is how to capture a large enough human view for providing good quality evidences. An active cam-era (for example, a pan, tilt, zoom (PTZ) camcam-era) is an appropriate selection to deal with this task. However, a surveillance system usually is integrated by multiple cameras be-cause a camera network can cover widely areas and reduce the number of blind spots. The issue therefore is expanded to a PTZ camera network reconfiguration problem. Therefore, we focus on the problem how to reconfigure the parameters of a PTZ camera network to capture people in cameras’ field of views. The second issue of a traditional CCTV surveil-lance system is lack of real-time reaction because it only records all camera contents into a storage. To make the surveillance system available for detecting and alarming in real-time, human motion must be recognized. Therefore, we proposed our human motion recognition approach based on analyzing motion regions and their structure.

(19)

1.2.1 Scene Segmentation

Scene extraction is the first step toward semantic understanding of a video. It also provides improved browsing and retrieval facilities to users of video database. This work presents an effective approach to movie scene extraction based on the analysis of background images. Our approach exploits the fact that shots belonging to one particular scene often have similar backgrounds. Although part of the video frame is covered by foreground objects, the background scene can still be reconstructed by a mosaic technique. The proposed scene extraction algorithm consists of two main components: determination of the shot similarity measure and a shot grouping process. In our approach, several low-level visual features are integrated to compute the similarity measure between two shots. On the other hand, the rules of film-making are used to guide the shot grouping process. Experimental results show that our approach is promising and outperforms some existing techniques.

1.2.2 PTZ Camera Network Reconfiguration

In this work, we propose a non-linear objective function that better utilizes a network’s cameras to track multiple targets. We also show that, by expanding the unknown parame-ters and imposing new constraints, the non-linear objective function can be converted into a linear production game (LPG) problem. Since an LPG yields an optimal solution that can be evaluated in polynomial time, the proposed method is efficient and accurate. The results of simulations and a real-world experiment demonstrate the proposed method’s potential.

1.2.3 Human Motion Recognition

In this work, a local feature-based human motion analysis framework is proposed. Local features extracted from the motion regions are utilized to construct the relationship among

(20)

different features, which is used to generate a spatial structural feature set, representing a human motion. In this work, we propose a clay based feature to describe long-term movement trend and a motion history image (MHI) based feature to describe short-term shape variation, respectively. In addition, the AdaBoost approach is applied to select a best feature set for discriminating the human motions. Our experiments demonstrate that the proposed approach can achieve high average recognition results with a noise tolerance capability.

1.3 Dissertation Organization

The remainder of this dissertation is organized as follows. In Chapter 2, a mosaic-based scene segmentation method is introduced. Then, A linear production game solution for capturing human motion is proposed in Chapter 3. In Chapter 4, the proposed framework for human motion recognition is described in detail. Finally, Chapter 5 contains some concluding remarks and future work.

(21)

Scene Segmentation

In this chapter, we describe the proposed framework for segmenting scenes from a movie using background information. First, the scene segmentation problem and the motivation of this work are first introduced. We then propose our approach based on background information. Next, the segmented scene are shown in the experimental results section. Finally, the concluding remarks are described.

2.1 Introduction

The advances in low cost mass storage devices, higher transmission rates, and improved compression techniques have led to the widespread use and availability of digital video. Video data offer users of multimedia systems a wealth of information and also serve as a data source in many applications including digital libraries, publishing, entertainment, broadcasting and education. However, because of the large amount of data and unstruc-tured format, efficient access to videos is not an easy task. To make the original video data in a database available for browsing and retrieval, it must be analyzed, indexed and reorganized. The suitably organized video data have the right structure for non-linear browsing and for content-based retrieval through large amount of data. To derive a

(22)

struc-archical structure using shots and scenes as construction units. A shot is a sequence of frames that were continuously captured by the same camera. The definition of a scene is not as straightforward and usually refers to a common environment that is shared by a group of consecutive shots [8]. For example, we could see many consecutive shots (taken by different cameras) share the similar visual content because they are produced in the same environment such as a meeting room or a sports field. Generally, a video scene is basically a story unit that shows the same objects and allows one event to be fully pre-sented. Shots in a video are analogous to words in a language in that they convey little semantic information in isolation. On the other hand, scenes reflect the dramatic and nar-rative structure of a video. One scenario is that humans remember different events after having watched a digital movie. Such an event can be a dialogue, an action scene, or a group of shots unified by location. Therefore, scene extraction is the first step toward greater semantic understanding of video content. The objective of video scene extraction is to cluster video shots into several groups, such that the shots within each group are re-lated by some common aspects. Nowadays, there exist various types of videos, including movies, news casts, sitcoms (situation comedies), commercials,sports, and documentary videos. Some of them have ”story units” such as movies, while others (e.g., sports) do not. In this work, we concentrate on movies. Usually, there are two steps to extract video scene structures after shot boundary detection. The first step is to represent visual con-tent of one shot and to define the similarity measure between two shots. The second is to group correlated consecutive shots into a scene. The compact representation of video shot content for shot similarity measure remains one of the most challenging issues. In our ap-proach, the similarity measure is based on the background information obtained through a mosaic technique. By aligning all images of a video sequence onto a common reference frame, the mosaic technique is able to generate the static background of a video scene. Although the background image alone is not an effective video content representation for some tasks such as video retrieval, it is sufficient for the task of scene extraction. It is also

(23)

worth noting that the way shots are grouped into a scene generally depends on the type of scene under analysis as well as on the video genre. The scenes of a TV-news report are different from the scenes of a basketball game, a documentary, or a movie. Hence, it is important to aggregate the shots by considering a model for the scene. In our approach, we exploit cinematic rules to devise a shot grouping algorithm.

2.2 Background and Motivation

Numerous methods for shot boundary detection have been proposed [9]. While shots are marked by physical boundaries, scenes are marked by semantic boundaries. Hence, scene boundary detection is far more difficult than shot boundary detection. Current tech-niques for video scene extraction can be broadly classified into three categories. The first groups shots that are visually similar and temporally close into a scene [10–19]. In these approaches, video shots are mostly represented by a set of selected key-frames. Low level features such as color, texture, motion, and shape are extracted directly from key-frames. Then, a classic clustering algorithm or simple peak detection is used to detect scene boundaries. However, the limitation of key-frame-based shot representation is that a frame taken from a shot often fails to represent the dynamic contents within the shot. When a sequence of shots is considered as a scene, it is often because the shots are corre-lated by the same environment rather than by visual similarity in terms of key-frames. In the second category, emphasis is put on the integration of audio and visual information. Various methods have been proposed to determine a shot boundary as a scene boundary if the visual and audio content change simultaneously [20, 21]. However, how to determine scene boundary efficiently still remains a difficult issue since the relationship between audio segments and visual shots is complicated. The third category exploits characteris-tics embedded in specific video domains, such as sports and news casts [22–25]. Since this approach is based on specific application models, it normally achieves high accuracy.

(24)

The main drawback is that an a priori model needs to be constructed beforehand for each application. The modeling process is time consuming and requires good domain knowl-edge and experience. In this chapter, we present a scheme for automatic video scene extraction based on visual information only. When a scene is composed of several shots, there will be at least one aspect in common between these shots. To measure the common aspects between two shots, we define the shot similarity using background information. Our approach is based on the following observations: (i) each frame of a video can be divided into foreground objects and background scenery; (ii) in most cases, while objects may move, appear and disappear, the background in a scene does not change significantly. These observations suggest that it would be more effective to focus on the background to detect a scene, which is a collection of shots unified by a common locale. However, in a single frame, most of the physical location information is invisible since it is concealed by the foreground objects. One solution to this problem is to use a mosaic technique. A mosaic is a panoramic image obtained by aligning all images of a video sequence onto a common reference frame [26, 27]. The resulting mosaic is an efficient and compact repre-sentation of a collection of frames [28]. In the case of a static mosaic, the moving objects blur out or disappear. Only the stationary objects and the background are displayed in the constructed mosaic. Therefore, in our approach, we use a mosaic technique to reconstruct the background scene of each video frame. Then, we compute the color and texture dis-tribution of all the background images in a shot to determine the shot similarity. Since our method does not depend on the content of key-frames only, it is able to fully exploit the spatiotemporal information contained in video sequences.

2.3 Mosaic Construction

A shot is the basic unit for video indexing. To facilitate subsequent video analysis, in our system, the original video sequences are segmented into shots [29]. We have also

(25)

proposed an algorithm to align frames from a video shot to build the static mosaic of the background [30]. Here, we briefly describe the generation of mosaics, and refer the reader to Ref. [30] for further details. We first need to derive the transformation between partially overlapping frames. Assuming background motion (due to camera motion) is the dominant motion in the video, the image motion of the majority of scene points can be approximated by the following transformation model:

u(x, y) = a₁x+ a₂y+ a₃, (2.1)

v(x, y) = −a2x+ a1y+ a4, (2.2)

where (u(x, y), v(x, y)) is the motion vector at image position (x, y). This model is a special case of an affine model with six parameters. However, experimental results show that our model is better than the general affine model. This is because we impose more constraints on the models parameters to avoid some undesirable transformations implied in affine model such as skewing. In this step, we need to obtain the motion vectors (or displace-ments) between successive frames. Existing approaches are based on feature matching or optical flow computation. These techniques are computationally intensive. To reduce the processing time, we use the motion vectors encoded in the MPEG-1 [31] video stream directly. Fig. 2.1 shows a frame of a baseball game sequence and its corresponding mo-tion vectors. Given the momo-tion vector, the image momo-tion parameters can be estimated by a robust regression technique called least-median-squares [32]. Let the motion vec-tors of each frame be denoted as {[u1(x1, y1), v1(x1, y1)], ..., [un(xn, yn), vn(xn, yn)]}. The

least-median-squares method can be described as

min

ˆ

a {median[(a1xi+ a2yi+ a3− ui)

2_{+ (−a}

(26)

where ˆa= (a1, a2, a3, a4) is the parameter vector. One distinctive property of the

al-gorithm is that it can tolerate up to 50% of the outliers in the data set, i.e., half of the data set can be arbitrary without significantly effecting the regression result. Therefore, this technique can robustly estimate the motion of the majority of scene points (background) and will not be biased by the minority scene points (moving object). Once the trans-formations between successive frames have been determined, the transtrans-formations can be composed to obtain the alignment between any two frames of the video sequence, and in particular, between the current frame and the reference frame. In most cases, we choose the first frame of the sequence as a reference. After all the frames have been aligned to a reference frame, the next step is the selection of pixels to be put into the resulting mo-saic. The gray value of each pixel of the mosaic is computed by applying an appropriate temporal operator to the aligned frames. The temporal average operator is effective in re-moving temporal noise, but the re-moving objects appear blurred, with ”ghost-like” traces in the resulting mosaic. The temporal median operator can remove the moving objects and produce a static mosaic of the background, but it is computationally expensive and is an off-line process. Therefore, we propose a novel scheme that is both effective in deleting moving objects and feasible for the on-line creation of panoramic images. Our approach is based on the observation that each overlapping pixel of the aligned frames will fall into one of the two categories: background or moving object. Since background motion is the dominant motion of video and we want to build the mosaic of background, we select the pixel that appears most frequently in the temporal domain. In Fig. 2.2, some sampled frames of ”Terminator II” sequence are shown. We observe that the camera has large movement in this sequence. Fig. 2.3 shows the static mosaic where only the background of the scene is visible. Using the background mosaic, the background scene image of each video frame is reconstructed. In the subsequent processing, we focus on these background image sequences.

(27)

Figure 2.1: A frame of a baseball game sequence and its corresponding motion vectors.

Figure 2.2: Sampled frames of a ”Terminator II” sequence.

2.4 Shot Similarity Measure

This section describes the color and texture features used in our work and how these features are taken into account in the formulation of shot similarity measure. A major requirement for shot similarity measure is to define a content representation that captures the common aspects or characteristics of the shot. One common method is to select one key-frame from the shot and use the image features of that key-frame as an abstract representation of the shot. For shots with fast changing content, one key-frame per shot is not adequate. Besides, the content description it provides varies significantly with the key-frame selection criterion. To avoid these problems, a more feasible approach is to consider the visual content of all the frames within a shot for shot representation. Color is one of the most widely used visual features in video content analysis. Most scene extraction algorithms compare color histograms between key-frames to determine the shot similarity measure. The histogram-based approach is relatively simple to implement and provides reasonable results. However, due to its statistical nature, the color histogram cannot capture the spatial layout information of each color. When the image collection is

(28)

Figure 2.3: The static mosaic of the ”Terminator II” sequence.

large, two different content images are likely to have quite similar histograms. To remedy this deficiency, in our approach, the distribution state of each color in the spatial (image) domain is also taken into account. The color histogram of an image is constructed by counting the number of pixels of each color. The main issues regarding the construction of color histograms involve the choice of color space and quantization of the color space. The RGB color space is the most common color format for digital images, but it is not perceptually uniform. Uniform quantization of RGB space yields perceptually redundant bins and perceptual holes in the color space. Therefore, non-uniform quantization may be needed. Alternatively, HSV (hue, saturation, intensity) color space is chosen since it is nearly perceptually uniform. Thus, the similarity between two colors is determined by their proximity in the HSV color space. When a perceptually uniform color space is chosen, uniform quantization may be appropriate. Since the human visual system is more sensitive to hue than to saturation and intensity [33], H should be quantized finer than S and V. In our implementation, the hue is quantized into 20 bins. The saturation and intensity are each quantized into 10 bins. This quantization provides 2000 (= 20 × 10 × 10) distinct colors (bins), and each bin with non-zero count corresponds to a color object. Since we are interested in the whole shot rather than single image frame, only one

(29)

histogram is used to count the color distribution of all background images within a shot. Then, each bin of the resulting histogram is divided by the number of frames in a shot to obtain the average histogram. Next, several spatial features are calculated to characterize the distribution state of each color object in each image frame. Assuming a set of pixels S= {(x1, y1), ..., (xn, yn)} belong to color object ci, k is the image size, and m is the total

number of 4-connected pixels in S. Then, we define

1. the density of distribution as f_i1= n_k,

2. the compactness of distribution as fi2= m_n, 3. the scatter as fi3= 1 n√k∑ n j=1 q (xj− xµ)2+ (yj− yµ)2,

where x_µ= (1/n) ∑ni=1xiand yµ = (1/n) ∑ni=1yi.

To define the fourth feature, the image is partitioned equally into p blocks of size 16 × 16. A block is active if it contains some subsets of S Let the number of active blocks in the image frame be q, we define

4. fi4= q_p.

After the spatial features of all images are computed, we take average of these values, respectively. Let fi1, fi2, fi3, and fi4be the average feature values of a color object ciin a

shot, for two color objects ciand cj, the difference in the spatial distribution within a shot

is defined as

Ds(ci, cj) =

1

(30)

Texture refers to the visual patterns that have properties of homogeneity that do not result from the presence of only a single one color or intensity only. It contains impor-tant information about the structural arrangement of objects and their relationship to the surrounding environment. We define the coarseness of an images texture in term of the distribution density of the edges. The Canny edge detector is used to extract edges from an image. The edge location indicates sharp intensity variation. Psychophysical experiments have shown that the human visual system is sensitive to the high-frequency regions of an image such as edges. The detected edge image is partitioned into a set of 16 × 22 blocks. A block is textured, if the number of edge points in the block is greater than a threshold (=30, in our setting). Then, we can compute the ratio of the textured block of each image and its average value over a shot. The texture similarity between two shots is determined by the minimum of the two average values. In Fig. 2.4, two images with different level of texture coarseness are shown. Fig. 2.5 shows the detected edge image partitioned into a set of 16 × 22 blocks. Histogram intersection is a popular similarity measure used for color-based image matching [34]. It yields the number of pixels that have same color in two images. In our work, we extend this idea to shot similarity measure. Let A,B be the set of all color objects in shot S1and S2, respectively, for a given u ∈ A, its similar color

object in B is some v ∈ B such that ku − vk < ε, where ku − vk denotes the Euclidean distance between u and v in the HSV color space, and ε is a threshold (=3, in our setting). Then, (u, v) is called a similar color pair. Let Ω = {(u, v)|(u, v) ∈ A × B, (u, v) is a similar color pair }, the shot similarity measure between S1(with the average histogram H1) and

S₂(with the average histogram H2) is defined as

ShotSim(S1, S2) =

1 k

∑

(u,v)∈Ω

(31)

where k is the image size; t1and t2are the average ratios of textured block for shot S1and

S₂, respectively; wt is the weight of texture feature; Dsis the difference in spatial features

as defined in Eq. (4); and W is a weight function defined as

W(x) = 1

1+ea×x+b

The weight function W is the general form of the sigmoid function which is frequently used in neural networks computation [35], where a and b are parameters. In our work, it is used to fuse the spatial distribution information with a histogram. The construction of this weight function is motivated by the psychophysical observation that the effect of spatial distribution on human perception is progressive [36]. Only when the difference in spatial features is greater than a threshold, humans perceive significant visual variation. The property of the sigmoid function fulfills this requirement. In our system, we set a = 10 and b = −5. As shown is Fig. 2.6, the functions value becomes significantly small for x> 0.75.

It is noted that a given color object in shot S1 may have more than one similar color

objects in shot S2 as illustrated in Fig. 2.7. To avoid the overlapping contribution in

calculating shot similarity, after each step of min(H1(u), H2(v)), H1(u) and H2(v) are all

subtracted by min(H1(u), H2(v))

(32)

Figure 2.5: The detected edge image is partitioned into a set of 16 × 22 blocks.

Figure 2.6: Sigmoid function with parameters a = 10 and b =-5.

2.5 Scene Extraction

This section describes the details of applying the cinematic model to group correlated consecutive shots into one scene. Movies directors, while filming scenes, also control the pace of a film in order to sustain the viewers interest. One important factor known to influence the pace of a movie is the Montage. Montage usually refers to a model that defines the usage of editing effects to assemble the shots into a smooth sequence in physical time and/or space and in the psychological association of ideas [37]. In order to convey an idea that has a strong resonance with viewers, Montage is widely used as the

(33)

Figure 2.7: Finding similar color object pairs.

basis to model scenes. In most situations, Montage can be simplified as a set of cinematic rules. Commonly used rules include [38, 39]:

1. Parallel rule: It is used to compose scenes involving multiple themes, where shots from different themes are shown alternately. This rule is frequently used to model interactions between two parties such as conversations, hunting, and chasing.

2. Concentration rule: It starts with long distance shot, and progressively zooms into close-up shots of the main objects. It is used to introduce the main objects and their context.

3. Enlargement rule: It is the reverse of the concentration rule. It is used to introduce the context of the current main object before switching to other objects that possibly share a similar context. Thus, it typically signals the transition to a new scene.

(34)

loca-tion, time, space, and topic.

Together, these rules can be used to model most types of scenes. We use this knowl-edge to develop a two-pass algorithm for scene boundary detection suitable for feature movies. The first pass of the algorithm deals with the detection of potential scene bound-aries. This is achieved by computing the shot similarity between two consecutive shots (see Fig. 2.8). Given a sequence of shots S1, ..., Sn, if ShotSim(Si+1, Si) < Tµ then a

po-tential scene boundary is detected. The threshold T_µ is empirically set to be the mean of all shot similarities, i.e.,

T_µ= 1 n− 1 n−1

∑

i=1 ShotSim(Si+1, Si). (2.6)

Any two adjacent potential scene boundaries delineate a candidate scene. Thus, all shots are grouped into a set of candidate scenes (see Fig. 2.9). The above algorithm assumes that all shots in a scene take place in the same location and share many com-mon backgrounds. This is true for most of the scenes composed using the serial con-tent rule and parallel rule (such as a conversation in a studio). However, the algorithm tends to oversegment the more complex scenes composed using parallel or concentra-tion/enlargement rules. In the outdoor chasing scene (also modeled by parallel rule), the escapee and pursuer shots are shown alternatively. But the background of both types of shots may be different. One important component of a scene defined by the concentra-tion/enlargement rule is the close-up shot where more than half of the frame is occupied by the foreground object. Because of the limitations of the mosaic technique, the back-ground information can not be recovered completely. Thus, the close-up shot will be identified as belonging to another scene. To handle such scenes, we need to merge the candidate scenes further. Let G1= S11, ..., S1m and G2= S21, ..., S2n be two candidate

scenes consisting of m and n shots, respectively. The scene similarity between G1and G2

(35)

SceneSim(G₁, G₂) = 1 m× n m

∑

i=1 n

∑

j=1 ShotSim(S_1i, S_{2 j}) (2.7)

Two scenes G1 and G2 are visually similar, if SceneSim(G1, G2) > Tµ− σ /2 where

T_µ (defined in Eq. (6)) and σ are, respectively, the mean and variance of all similarity measures between every two adjacent shots. Because several simple scenes are merged into one complex scene, this threshold should be smaller than that of the first pass. Given a sequence of candidate scenes, the second pass of the scene extraction algorithm mainly consists of the following merging process:

Step 1: Set the expanding scene to be the first scene.

Step 2: Compare the expanding scene with two subsequent scenes B and C (see Fig. 2.10). Step 3: If the expanding scene and scene C are visually similar, then

1. merge the expanding scene, scene B and scene C into one scene;

2. set the expanding scene to be the merged scene;

3. go to Step 2.

Step 4: If the expanding scene and scene B are visually similar, then

1. merge the expanding scene and scene B into one scene;

2. set the expanding scene to be the merged scene;

3. go to Step 2.

Step 5: Set the expanding scene to be scene B and go to Step 2.

This process is repeated until no more scenes can be merged. It is noted that B and C always refer to the two scenes immediately following the current expanding scene and they are always updated in Step 2.

(36)

Figure 2.8: Computing the similarity between every two adjacent shots.

Figure 2.9: Shots are grouped into several candidate scenes.

2.6 Experimental Results

Six test videos in MPEG-1 format were used to evaluate our scene extraction algorithm: one home video and five full-length movies. The home video ”Lgerca lisa 1” is an MPEG standard test video with ground truth from original video provider. The genres of movies include action, drama, comedy, thriller, and music. The testing with five different genres of movies would ensure that the overall performance of the algorithm is not biased to-ward a specific movie kind. To get the ground truth of other videos, two graduate students were invited to watch the movies and then asked to give their own scene boundaries. The intersection of their segmentation was used as the ground truth for the experiments. In movies, there is usually not a concrete or clear boundary between two adjacent scenes due to editing effects. Therefore, we follow Hanjalics evaluation criterion [12]: if the detected scene boundary is within four shots from the boundary detected manually, this boundary is counted as a correct boundary. Basic information about the test videos and the experimental results are shown in Table 2.1. As shown in Table 2.1, our algorithm correctly extracts all the six scenes of the first video (Lgerca lisa 1) without any missed

(37)

or false detection. Fig. 2.11 shows the shot grouping result of the first video, where the first frame of each shot is displayed. Fig. 2.12 shows one extracted scene (consisting of 4 shots) from another video ”LittleVoice”. This scene is composed by the parallel rule and has different backgrounds in the first and second shot. Our shot group algorithm is able to identify both shots as belonging to the same scene. However, our algorithm has some false and missed detections in the other test videos. The false detection is due to the significant change of lighting such as explosions and flashing lights. This could perhaps be improved by a more sophisticated visual similarity measure. The missed detection is mainly caused by inappropriate setting of the merging threshold. As the threshold de-pends on the variance (σ ) of all similarity measures between every two adjacent shots, a too large σ results in under-segmentation of the video. This type of error occurs in some video scenes with very inconsistent pace. For performance comparison, we implement the well-known scene extraction algorithm proposed by Yeung et al. [10]. In their approach, a video sequence is first segmented into shots. Then, a time-constrained clustering algo-rithm is used to group visually similar and temporally adjacent shots into clusters. The visual similarity between two shots is measured by comparing color histograms of the re-spective key-frames. Finally, a scene transition graph is constructed based on the clusters, and cutting edges are identified to construct the scene structure. For fair comparison, the parameters of Yeung’s algorithm are tuned to achieve the best performance. To measure the performance quantitatively, two metrics are used:

recall= _D+MDD , precision =_D+FDD

where D is the number of scene boundaries detected correctly, MD is the number of missed detection and FD is the number of false detection. Table 2.2 shows the perfor-mance comparison. As ”Lgerca lisa 1” is an MPEG standard test video, some related works have also used it as test video. According to the reports of Lin [14] and Ngo [15], their approaches have three and two false detections, respectively.

(38)

Table 2.1: Accuracy measures for six test videos.

Video Genre Duration No. of Scenes Correct Missed False Title (in minutes) (ground truth) Detection Detection Detection Lgerca lisa 1 Home Video 15 6 6 0 0 Dungeons & Dragons Action 107 66 54 12 6 Little Voice Drama 96 141 110 31 24

Hot Chick Comedy 104 104 71 33 52 Bugs Thriller 82 76 58 18 26 Walk the Line Music 136 118 70 48 35

Table 2.2: Performance comparison for scene extraction.

Video Our Approach Yeung’s Approach Title Recall Precision Recall Precision Lgerca lisa 1 100% 100% 100% 85.7% Dungeons & Dragons 81.8% 90.0% 74.2% 81.6% Little Voice 78.0% 82.1% 72.3% 72.3% Hot Chick 68.3% 57.7% 54.8% 50.9% Bugs 76.3% 69.0% 59.2% 64.3% Walk the Line 59.3% 66.7% 47.5% 56.0%

2.7 Concluding Remarks

In this chapter, we have proposed a mosaic-based algorithm for extracting scene struc-tures from digital movies. Our approach is based on the idea that shots belonging to one particular scene often have similar backgrounds. Using a mosaic technique, the back-ground of each video frame can be recovered. The color feature and texture feature of each background image are integrated to compute shot similarities. Based on the movie making model, our algorithm is able to group correlated shots into one scene. The compu-tation is costly, but the spatiotemporal information of videos is fully exploited to achieve scene extraction. Experimental results show that the proposed approach works reason-ably well in detecting most of the scene boundaries. Compared with some existing tech-niques [10, 14, 15], our approach is promising. Our approach can be applied directly to organize videos and can be utilized to provide browsing/retrieval facilities to the users.

(39)

As scene is a subject concept to reflect human perception, our future work will focus on investigating an adaptive technique to perform user-oriented scene extraction. The pro-posed shot similarity measure does not use motion feature to capture temporal variation in a video. Thus, another future research issue will be the integration of motion information into the proposed shot similarity measure for other tasks such as video retrieval.

(40)

(41)

(42)

PTZ Camera Network Reconfiguration

In this chapter, we propose a linear production game solution for a pan, tilt, and zoom (PTZ) camera network to reconfigure cameras’ parameters. First, we give an introduction about this research topic. The proposed approach is then presented. Next, simulations and a real world experiment are detailed. Finally, the conclusion is given.

3.1 Introduction

Intelligent video surveillance systems have been used for several years, and they are now widely deployed in important places, such as airports, all over the world. A single cam-era can provide useful information for event detection and target tracking [2, 3]; however, a surveillance system based on a camera network can reduce the number of blind spots and improve the system’s reliability [40–46]. Camera networks are usually comprised of a heterogeneous collection of cameras, including panoramic cameras, fixed cameras, in-frared cameras, and pan-tilt-zoom (PTZ) cameras. Among the different types of imaging devices, PTZ cameras are the most important components of an intelligent surveillance system because their field of view (FOV) can be changed in response to different task requirements. However, incorporating PTZ cameras into a surveillance system raises a challenging issue: How can the cameras be controlled and coordinated to accomplish a

(43)

given task? Most surveillance tasks performed by PTZ cameras are related to three func-tions: tracking multiple targets, improving evidential quality and maximizing surveillance coverage.

Target tracking involves target detection as well as temporal and/or inter-view tar-get correspondence matching. For example, Lim et al.’s approach [40] tracks the tartar-gets observed in FOVs and constructs a dynamic scene model containing the position, veloc-ity, and view-dependent visibility of each target. The system tries to accomplish three objectives, namely, initial detection of moving objects, tracking of moving objects, and scheduling cameras to monitor activities. Cameras are assigned to tasks by solving a bi-partite matching problem so that tasks are accomplished in order of priority. In the PTZ camera network developed by Ukita and Matsuyama [41], when the system detects a new target, nearby cameras that are idle are assigned to track the target. The system is simple and effective provided that the number of cameras is greater than the number of targets. Qureshi and Terzopoulos [42] introduced a multi-camera tracking system in which cali-brated wide-FOV cameras are used to locate targets, and PTZ cameras are used to fixate on the located targets. The PTZ network operates according to heuristic rules designed to track targets cooperatively. Their method was further improved so that it can be ap-plied to an uncalibrated multi-camera surveillance system connected with wireless com-munication network [47]. Since the wireless comcom-munication range is limited, Qureshi and Terzopoulos assumed that each camera can only communicate with its neighbors. Therefore, cameras within a communication range can share information of targets to be tracked. Adding/removing camera nodes can be accomplished very easily with their method. However, the flexibility is exchanged for security because, in general, a cam-era network sufficiently sharing information with a centralized server can outperform a camera network sharing information locally. In summary, a cooperative target tracking approach can utilize the camera resources efficiently and enable the cameras to support each other to recover the tracking when a tracking task fails.

(44)

While a solution to the target tracking problem can provide each target’s trajectory, a surveillance system usually requires more information. For example, it is often necessary to capture the face of a human target or the license plate of a vehicle at a high resolution. These applications are related to improving evidential quality [44, 45]. In addition, PTZ cameras are frequently used to extend the coverage of the surveillance area by pan-tilt scanning. Piciarelli et al. [46] proposed an approach that reconfigures the pan-tilt-zoom parameters of all PTZ cameras based on the probability of observing an event in a specific location. Song et al. [43] applied game theory to maximize the surveillance coverage in their decentralized system, and adopted a sequential optimization strategy to achieve the Nash equilibrium [48]. Under this method, one PTZ camera is selected at random each time and its parameters are tuned, while those of the other cameras remain unchanged. After the Nash equilibrium is achieved, the cameras should cover the entire surveillance area at an acceptable resolution. When a human operator decides to track a specific target at a higher resolution, the target will be assigned to the most appropriate PTZ camera, which will then be excluded from the game. As a result, the remaining cameras have to adjust their parameters and try to maintain the maximum surveillance coverage. Based on the result of [43], they introduce a distributed consensus algorithm to solve target tracking and activity recognition problems [49] [50]. They modified the utility function in the game theoretic control framework [43] and then applied the method to control the cameras to cover the entire surveillance area. They also proposed to use a decentralized Kalman filter algorithm to track the position and velocity of each target. The game theoretic framework has two advantages: 1) it can be implemented easily; and 2) only a small amount of information needs to be exchanged. However, we remark that the Nash equilibrium is not necessarily an optimal solution. Moreover, tracking a specific target at a higher resolution is treated as an exceptional task that cannot be optimized by using the same game theory framework. Another potential problem of their method is that there is no mechanism to suppress investing too many resources (i.e., cameras) in tracking a single target.

(45)

Reconfiguring PTZ cameras to achieve any of the three objectives is intrinsically a combinatorial optimization problem. Computing the optimal solution is very time con-suming and, therefore, existing methods usually reduce the problem into a bipartite match-ing problem, which assigns tasks to cameras. However, the task assignment formulation does not fully utilize the camera network. For example, when the number of tasks is greater than the number of cameras, some tasks will have to be abandoned despite that a camera may accomplish multi-tasks at the same time.

In this chapter, we propose an optimal and flexible solution to the PTZ network co-ordination problem. We show that the problem can be formulated as a linear production game (LPG). The LPG is about how a group of collaborative players utilizing their lim-ited resources to create various products yielding the maximum payoff given that the price of each product is known. Players in the LPG of a PTZ network are the cameras. Each camera can control its FOV by selecting the PTZ parameters. Although the number of all PTZ combinations is very large, due to the limited speed of PTZ actuators, only a small set of new PTZ settings has to be considered. The new FOV corresponding to each new PTZ setting is the product of the game. Resources owned by each player (i.e., camera) are the targets which are observable to the camera. The observability of a target to a cam-era is determined by checking whether the camcam-era can select a feasible PTZ settings to observe the target. The price of a product (i.e., the new FOV of a camera) is evaluated by examining the video quality of each target in the FOV. The goal of this cooperative game is to select a set of FOVs for the cameras to maximize the total payoff (video quality of the targets). The LPG is a special case of a linear programming problem. While a lin-ear programming problem may be infeasible or unbounded [51], the LPG always has an optimal solution that can be evaluated in polynomial time. Therefore, many techniques, such as branch-and-bound and cutting-plane techniques [52], can be applied to solve the camera network reconfiguration problem.

(46)

Table 3.1: Symbol table used in Section 3.2 n camera number

m detected target number

gk_t status vector of target k at time t b_tk 3D bounding box of target k at time t vk_t velocity of target k at time t

Ut the status of m targets at time t

ˆ

U_t+1 the predicted status of m targets at time t + 1 φi_j the j-th feasible FOV in i-th camera

Φi the feasible FOVs in i-th camera wi number of feasible FOV in i-th camera

Q(.) the quality function of an FOV combination Q_k(.) the quality function by observing k-th target

3.2 Problem Formulation

Suppose a surveillance system contains n calibrated PTZ cameras, each of which is con-trolled by a network-connected processor. In addition, a fixed (non-PTZ) camera in our system is seen as a PTZ camera with only one available FOV. Furthermore, let m be the number of targets detected in the surveillance area. Each detected target is repre-sented by a status vector denoted by g_tk =b_tk, vk_t, where bk_t and vk_t are, respectively, the 3-D bounding box and the velocity of target k estimated at time t. The target’s sta-tus Ut= gkt

k= 1, 2, ..., m and the static background constitute a dynamic scene model (targets history positions and a top-view scene model) that can be used to predict the status of all the targets m at time t + 1, expressed as ˆUt+1= ˆgk_t+1

k= 1, 2, ..., m . The model is maintained by a central information processing node (a central server) that gathers in-formation about the detected targets and the camera parameters from each camera node. It is assumed that the camera network has been calibrated and the homography (a point to point mapping matrix) between any two of the cameras is known so that the information about the detected target can be integrated. The central information processing node is also responsible for determining the optimal camera parameters.

(47)

3.2.1 Parameters to be Determined

Let φidenote the i-th camera’s FOV, which is controlled by the pan-tilt-zoom parameters of the camera. We assume that the relationship between the FOV and the parameters is known. Therefore, the problem of determining the optimal camera parameters is trans-formed into a problem of selecting the optimal FOV for each camera. Because of the limitation of the lens motor speed, a camera can only change its parameters locally in a short time. Hence, given each camera’s current parameters, a set of feasible FOVs can be constructed and expressed as follows:

Φi= n φi_j j= 1, 2, ..., wi o , (3.1)

for i = 1, 2, ..., n, where wi is the number of feasible FOVs of the i-th camera. The PTZ

camera coordination problem is formulated as the following combinatorial optimization problem:

φ1∗, ..., φn∗ = arg max

φi∈Φi,i=1,...,n

Q φ1, ..., φn , (3.2)

where Q (.) : Φ1× · · · × Φn _{7−→ R is a function mapping φ}1_{, ..., φ}n_{to a real quality}

value.

In the next subsection, we explain how to assess the quality of a set of FOVs for different goals.

3.2.2 Quality Function of A Camera’s FOV

Under the dynamic scene model, the locations of predicted targets can be computed for each camera’s FOV. The predicted bounding box is defined as a region of interest (ROI). In a visual surveillance system, assessing the quality of an FOV usually involves the following two steps.

(48)

should be evaluated as having the lowest quality.

2. Evaluate the dimensions (width and height) of each ROI. The resolution of the ROI should be sufficient to accomplish the given task. Aldrige and Gilbert [1] suggested different resolution requirements for different tasks. If a resolution is lower than the suggested value, a low quality value should be assigned to it. Conversely, if the resolution is higher than the suggested value, the quality value should be upper bounded or reduced to induce camera zoom out for monitoring a larger area.

For most surveillance tasks, the quality of each camera’s FOV can be evaluated indi-vidually and the total quality function, Q φ1, ..., φn, can be simplified as follows:

Q φ1, ..., φn = f (q1, q2, ..., qn) , (3.3)

where qi= Q φi for i = 1, 2, ..., n, and f (·) : Rn 7−→ R is a function that maps the n

individual quality values to a total quality value. We discuss possible choices of f (·) later in this section. Furthermore, the quality function of each FOV, say φi, can be expressed as a function of the qualities of individual ROIs. Since the quality of each ROI can be evaluated independently, it is reasonable to compute the quality of an FOV as follows:

Q φi =

m

∑

k=1

Q_k φi , (3.4)

where Qk φi , Q ˆbkt+1; φi, ˆTt+1− ˆgkt+1 is the non-negative quality of observing the

k-th target with FOV φi. The value is zero when ˆb_t+1k is not observable in FOV φior it is completely occluded by other targets at ˆT_t+1. The simplest form of f (·) in Equation (3.3) is a linear summation function given by

Q_L φ1, ..., φn =

n

∑

i=1

(49)

However, since the qualities of a target in all views count toward the total quality function, maximizing (3.5) makes all the cameras pursue high video quality targets and ignore low quality ones.

To resolve the problem, we adopt the following non-linear quality function in this work: Q_NL φ1, ..., φn = m

∑

k=1 max i Qk φ i_, _(3.6)

where only the maximum ROI quality of each target counts toward the total quality. Thus, the quality of a solution that favors a specific target will be lower than that of a solution that assigns the cameras to monitor different targets.

3.3 The Proposed Approach

In this section, we show that the non-linear objective function (3.6) can be converted into a linear function by expanding the set of feasible solutions and imposing new constraints. Let T_jiand T i j

denote the set of targets covered by FOV φ

i

j, i.e., the j-th feasible FOV

of the i-th camera (refer to equation (3.1)), and the number of targets in T_ji respectively. The total number of subsets of T_jiis 2|Tji|_{, and h is an index of the subset. For each subset}

Si_j,h⊂ Ti

j, 1 ≤ h ≤ 2| T_ji|

, we can construct a virtual FOV that ignores any target not in Si_j,h, i.e., Qk φi_j,h = 0, for all k /∈ Si j,h(see Figure 3.1).

Notably, introducing virtual FOVs into the system increases the number of expanded feasible FOVs to ∑ni=1∑

wi

j=12|

T_ji| ≤ nw

max2|Tmax|, where wmaxand |Tmax| are the maximum

number of feasible FOVs of a camera and the maximum number of targets in an FOV, respectively. From the complexity analysis, it is obvious that the computation load is lin-early proportional to the number cameras and is exponentially proportional to the number of targets in an FOV. The exponential growth of the variables may lead to the scalability problem. However, since a PTZ camera is mainly used to acquire high definition images of targets by choosing a proper zoom setting, the number of targets observed by a PTZ

(50)

Figure 3.1: The decomposition of an FOV containing T i j targets into 2| T_ji| − 1 virtual FOVs.

camera is very limited. To give an impression about the typical number of targets covered by the FOV of an camera in different surveillance applications, we tabulate in Table 3.3 the suggested target sizes with respect to four different applications described by Aldrige and Gilbert [1]. According to the data shown in Table 3.3, the maximum number of targets in an FOV of a camera is less than 15 for the recognition and identification applications. In practice, the maximum target number will be much smaller than this value and the solution can be computed very efficiently. The scalability problem emerges only when one uses too few PTZ cameras to observe too many targets. In that case, PTZ cameras are operating at the wide-angle (low resolution) mode, and the video content is less in-formative. To acquire useful surveillance videos, one should consider introducing more PTZ cameras into the network. Hence, although from the algorithmic point of view, the proposed method might suffer from scalability issues, in practice, the scalability issues can be ignored.

By replacing φi_j with the virtual FOVs, φi_j,h, h = 1, ..., 2|Tji|_{, the number of feasible}

FOVs to be assigned to the i-th camera becomes ∑wi

j=12| T_ji|

(51)

Table 3.2: Symbol table used in Section 3.3 T_ji targets covered by the FOV φi_j

T i j

number of targets covered by the FOV φ

i j

Si_j,h subset of T_ji

w_max the maximum number of feasible FOVs of a camera |Tmax| the maximum number of targets in an FOV

φi_j,h virtual FOV expanded from φi_j

xi_j,h binary variables to indicate φi_j,his selected

o_{i jhk} binary coefficient to indicate whether the k-th target is observable in φi_j,h q∗_k the quality of the k-th target evaluated by one of the optimal FOVs

Table 3.3: The required target size of different surveillance applications suggests in [1].

Applications Suggesting target Suggesting target size in Maximum suggesting height(%) in a a 640×480 FOV(average human target number in an FOV

CCTV FOV width-height ratio 0.3442 [53])

Monitor and control > 5% 9×24 1422 Detection > 10% 17×48 376 Recognition > 50% 83×240 15 Identification > 120% 199×576 2

(52)

we define the binary variables xi_j,h_{∈ B (B , {0, 1}) to indicate whether the ( j, h)-th virtual} FOV of the i-th camera, i.e., φi_j,h, is selected. Therefore, the optimization problem in FOV selection can be rewritten as follows:

max x n

∑

i=1 wi

∑

j=1 xi_j,h m

∑

k=1 Q_kφi_j,h , (3.7) subject to wi

∑

j=1 2|T ij|

∑

h=1 xi_j,h≤ 1, (3.8) for i = 1, 2, ..., n, and n

∑

i=1 wi

∑

j=1 2|T ij|

∑

h=1 xi_j,ho_{i jhk}≤ 1, (3.9) for k = 1, 2, ..., m, where x = x1_1,1, x1_1,2, · · · , xn wn,2|T nwn|

and the binary coefficient o_{i jhk}_{∈ B} indicates whether the k-th target is observable in the virtual FOV φi_j,h. The constraint specified in (3.8) ensures that each camera can only be assigned one FOV at a time, and (3.9) guarantees that the quality of a target can only be evaluated by a single FOV because the repeated target selection violates the constraint in equation (3.9) (the summation value is larger than 1.)

The relation between the solutions to (3.6) and (3.7) can be derived by changing the summation order of (3.7) as follows:

m

∑

k=1 " max x n

∑

i=1 wi

∑

j=1 xi_j,hQ_kφi_j,h # = m

∑

k=1 q∗_k, (3.10)

where q∗_k is the quality of the k-th target evaluated by one of the optimal FOVs.

The objective function and the constraints given in Equations (3.10), (3.8) and (3.9) form a linear programming problem. Since a linear programming problem may be in-feasible or unbounded [51], it is important to show that the above formulation yields an optimal solution that can be solved efficiently. In the following subsection, we show that