Introduction - 貝氏階層式結構於視訊監控之研究與應用

In recent years, plentiful vision based techniques have been investigated to boost intelligent functionalities of modern surveillance systems. Among those technologies, object detection and labeling are especially crucial. For a single-camera system, these two processes are the fundamental steps for advanced analyses, like object tracking and behavior understanding. Up to now, many frameworks have been used to detect and label targets of interest. For example, Schneiderman and Kanade [68] proposed a trainable object detector for the detection of faces and cars, based on the statistics of localized parts. Adaboosting detection algorithm [69] is another widely used technique for the detection of specific objects in 2-D images. However, since a 2-D image lacks 3-D depth information, the detection of targets usually suffers from the

occlusion problem, especially when multiple targets appear in a complicated scene.

An alternative way to deal with the occlusion problem is to use a multi-camera system. The cross reference of multiple camera views can effectively handle the occlusion problem and provide a reliable way for object labeling and correspondence.

Up to now, several multi-camera surveillance systems have been proposed for multi-target correspondence. These approaches can be roughly classified into two major categories – “direct correspondence” and “indirect correspondence”. For a

“direct correspondence” approach, moving objects are detected in each 2-D camera view first. After that, object correspondences are built among 2-D camera views and 2-D detection results in different camera views are fused together to support surveillance over the 3-D space. For instance, In [83], Khan et al. found the overlapped fields of view among cameras. Whenever a moving object enters an overlapped region, the correspondence of this object with respect to its counterparts in other camera views can be established. In [84], Hu et al. proposed a principal axis-based correspondence among multiple camera views. This method offers robust results and can tolerate a certain level of defects in the motion detection and segmentation of each camera view. Moreover, the typically required camera calibration step is not a necessity in their system. In [85], Black and Ellis established the correspondence by comparing the distance between the projected epipolar lines and the detected objects in each 2-D image. For a multi-camera system with a narrow baseline setup, the use of epipolar constraint provides an efficient way to establish the correspondence.

Basically, most “direct correspondence” approaches require the foreground regions of each target be correctly extracted in each camera view to ensure reliable correspondence. However, with the presence of occlusion, this requirement cannot be

by matching the color appearance of segmented regions along epipolar lines in pairs of camera views. In their approach, the mid-points of the matched regions are projected onto the 3-D space to yield a 3-D probability distribution map for the description of object position. Although this method may relax the need of accurate foreground extraction, it has the extra requirement of color calibration among multiple cameras. Incorrect correspondence may also occur while matching objects with similar color appearance.

In the “indirect correspondence” category, a multi-camera system fuses multi-view information onto a pre-selected data-fusion space. The fused information is then projected back to each camera view to build object correspondence. Typically, the 3-D space is chosen as the space for data fusion. For example, Utsumi et al. [88]

proposed the adoption of intersection points, which are the intersections of the 3-D lines emitted from the 2-D tracking results of different camera views. In that approach, a mixture of Gaussian functions was used to describe the possible positions of moving objects in the 3-D space. By projecting these 3-D Gaussian distributions back to individual 2-D image plane, the object correspondence among camera views is derived in a probabilistic manner. On the other hand, Fleuret et al. [89][90][91]

adopted a simple blob detector in 2-D analysis and introduced a generative model to fuse data from multiple views. In their system, a discrete occupancy map is designed to describe whether an individual target is standing at a specific ground location in the 3-D space. After that, the most likely trajectory of each individual over the 3-D ground plane is traced via the Viterbi algorithm. In [92][93], Huang and Wang proposed a model-based approach to efficiently fuse consistent 2-D foreground detection results from multiple camera views. A probabilistic method is further proposed to simultaneously label and map multiple targets based on a Markov network.

Instead of fusing multi-view information onto the 3-D space, Khan and Shah [94]

chose one of the 2-D camera views as the reference view for data fusion. In their approach, without relying on complicated camera calibration, they built a few homography matrices to map the projected ground planes in multiple camera views.

After that, they fused the foreground likelihood information from multiple views to the scene plane in the reference camera view in order to generate a probability map of the target location. Owing to the geometric consistence, the fused target location probability map, named the “synergy map” in [94], would indicate a higher probability for a true target location. The synergy map was finally rectified so that the target location on the reference image is remapped to the relative ground plane location in the 3-D space. Since the fused synergy map is built over a 2-D image space, the spatial resolution of the target location is influenced by the perspective projection and is non-uniform in the 3-D space. A target far away from the reference camera would have a lower location resolution, while a target close to the reference camera would have a higher resolution. In addition, it is a little complicated to utilize the prior knowledge of the 3-D targets into this 2-D fusion framework.

For these aforementioned “indirect correspondence” approaches, certain geometric ambiguity may cause “ghost objects” in the 3-D space. The ghost effect is another form of the inter-occlusion problem and is a classic problem in 3-D object reconstruction. Owing to the limited number of cameras around the surveillance zone, some ghost objects may occasionally fulfill the geometric consistency and appear in the reconstructed 3-D scene. These fake targets could severely affect the accuracy in building object correspondence. In recent years, several approaches have been proposed to suppress ghost objects in multi-camera applications. Including the aforementioned method in [94], most methods used the temporal consistency to

Otsuka and Mukawa [95] proposed a framework of multi-view occlusion analysis to track objects. Once if occlusion patterns are detected, some occlusion hypotheses are launched to indicate the uncertainty caused by occlusion. Since an occlusion structure usually lasts only for a short period, those hypotheses are tested recursively based on the temporal consistency to suppress fake detection. In [96], on the other hand, Guan et al. suppressed ghost targets by considering the consistency of color appearance. By projecting 3-D objects onto different image views, they identify ghost objects based on dissimilarity of colors. Moreover, their approach may automatically learn the appearance models for different objects in different camera views during the tracking process. This eliminates the requirement of color calibration among different cameras.

In this dissertation, we propose a new approach to efficiently integrate, summarize, and infer video messages from multiple client cameras. Even though we only use a simple foreground object detector to obtain imperfect foreground detection results, our system can still efficiently determine the number of moving targets inside the surveillance zone and accurately track the 3-D trajectories of the tracked targets.

Besides, our approach can perform image labeling in a pixel-level manner and match targets among multiple camera views. The rest of this chapter is organized as follows.

In Section 5.2, we present the main idea of the proposed framework, which is composed of a data fusion stage and an inference stage for multi-target labeling and correspondence. In Sections 5.3 and 5.4, we explain the details of the fusion stage and the inference stage, respectively. Experimental results and discussions are presented in Section 5.5.

在文檔中貝氏階層式結構於視訊監控之研究與應用 (頁 106-110)