Recently, computer vision technology for video surveillance applications has made tremendous progress. Using an intelligent surveillance system to manage parking lots or to monitor security zones is becoming practical. To add more values to existing surveillance systems, various kinds of vision-based intelligent functionalities have been explosively proposed. For example, some algorithms provide user-friendly ways to help operators in the control room to monitor tens of, or even hundreds of, cameras; while a few others provide the capability to automatically detect unusual events in the surveillance zone. These vision-based algorithms may be roughly classified into single-camera based methods and multi-camera based methods. Among those methods, object detection and object labeling are two essential processes for subsequent analyses, like behavior modeling and scene modeling. Object detection, such as face detection and vehicle detection, is an object-level classification that tells
whether and where a specific object is inside an image. On the other hand, object labeling is an identity-level (ID-level) classification that determines the identity of each object region in the image. An example of human detection and human identity labeling is shown in Fig. 1. Even though it seems very easy and straightforward for human eyes to perform object detection and labeling, a robust computational algorithm for these two operations is actually not trivial at all.
(a) (b) (c) F
ig. 1. An example of human detection and human identity labeling. (a) Test image. (b) Humandetection result. (c) Human labeling result, with different colors indicating different persons.
For a single camera system, the captured 2-D image lacks the depth information and the detection of moving targets usually suffers from the occlusion problem, which makes it difficult to correctly label or segment connective targets. To deal with occlusion, some methods adopt multi-camera approaches. Even though the cross reference of multiple camera views may ease the occlusion problem and provide a more reliable way for object detection and labeling, the object correspondence among multiple cameras may become another thorny problem.
On the other hand, to detect foreground objects, the appearance ambiguity between the foreground objects and the surrounding background is a challenging issue that may fail many widely-used object detection algorithms. For example, some background subtraction algorithms, like [1][2], focus mainly on the modeling of background information. These algorithms work pretty well for scenes with stationary
background. However, they may detect incomplete foreground regions while the appearance of foreground objects happens to be similar to that of the background. To overcome this appearance ambiguity problem, simply relying on pixel-level image data would not be enough. Some other information, such as region-level messages and object-level messages, should be taken into consideration.
Besides occlusion and appearance ambiguity, the perspective distortion in 2-D images is also a challenging issue. An object far away from the camera and an object close to the camera would have quite different scales and shapes in the camera views.
To overcome the perspective effect, some researches focused on invariant feature descriptors. In their approaches, they detect reliable feature points first and design appropriate feature descriptors for object classification. For example, difference of Gaussian (DoG) [3] and Harris-Laplace [4] operators are popular feature extraction operators. The SIFT (Scale Invariant Feature Transform) [5] descriptor is another widely-used operator that is invariant to illumination variation and affine transformation. Even though these operators perform quite well in detecting prominent features, they are still incapable of handling object labeling in complicated scenes.
Shadow effect and lighting variations are another two troublesome issues that degrade the robustness of present surveillance systems. Plentiful works have been proposed to solve these two problems. For example, Finlayson et al. [6] proposed an entropy minimization method to extract from an image the intrinsic image that is shadow-free. Matsushita et al. [7] proposed an illumination normalization method based on an off-line learned eigenspace to eliminate shadows. On the other hand, a few methods have been proposed to maintain reliable color appearance under varying illumination conditions. A review of these color constancy algorithms could be found in [8]. Moreover, in the last decade, the Bayesian approach and some learning-based
methods for color constancy have gotten great attention. A complete survey of Bayesian color constancy methods could be found in [9].
To overcome these aforementioned problems, like occlusion, appearance ambiguity, perspective effect, shadow effect, and lighting variations, we found most existing methods rely more on image observation but less on 3-D scene knowledge. In this dissertation, we focus on the inclusion of 3-D scene knowledge in object detection and object labeling. In our study, we found the usage of 3-D knowledge could be very helpful in handling these complicated issues. Moreover, from the aspect of system functionality, an important role of a practical surveillance system is to dynamically reveal the 3-D status of the surveillance zone. To achieve this functionality, a major task of an intelligent surveillance system would be to automatically infer the unknown 3-D status based on the observed images. In this dissertation, we propose a Bayesian hierarchical framework to realize the integration of 2-D image information and 3-D scene model in a unified and efficient manner for scene inference. The optimal inference of BHF provides a systematic way to resolve the image labeling problem and to find out the 3-D scene unknowns simultaneously.
We also apply the framework to two real applications of video surveillance. By using the hierarchical framework to represent the image generation model in a probabilistic manner, our systems can systematically integrate useful information from pixel level, region level, and object level to achieve semantic inference of the 3-D environments.