• 沒有找到結果。

In the literature, commonly used methods for object detection, object labeling and scene modeling can be roughly divided into three categories --- data-driven methods, model-driven methods, and hybrid methods. In general, data-driven methods directly use region-level and pixel-level information from the image data to support image analysis and the inference of the 3-D scene; while model-driven methods use a few object-based models pre-learned from training data to infer the scene statuses and to detect interested objects. On the other hand, hybrid methods are proposed to combine both image information and object knowledge for image analysis.

In this dissertation, the proposed BHF framework is a hybrid method. As shown in Fig. 14, the message stream propagated upward from the observation layer is considered as data-driven information; while the message stream propagated downward from the scene layer is considered as model-driven knowledge. This BHF framework has quite different properties if compared with either data-driven methods or model-driven methods. On the other hand, if compared with existing hybrid methods, the BHF framework proposes a new way to integrate pixel-level, region-level and object-level information under a unified framework. A few distinctive properties of the proposed BHF framework are to be explained as follows.

3.2.1 Differences to Data-driven and Model-driven

Methods

Compared with data-driven methods and model-driven methods, a distinctive feature of the proposed BHF is the integration of object-level information from 3-D scene, region-level constraints in 2-D image patches, and pixel-level features from image pixels in a unified framework, as presented in Fig. 14. The main characteristics of BHF has two aspects: (a) a unified framework to combine pixel-level, region-level, and object-level information together to represent the generation process from 3-D scene to 2-D image; and (b) a systematic procedure to simultaneously analyze 2-D images and infer 3-D scene statuses.

For most bottom-up methods, the process usually begins at the classification of each pixel into a target pixel or a non-target pixel. Since the pixels of a target usually share similar appearance, these methods merge target pixels into target regions based on region-level information in the image. However, when the appearance of a target region happens to be similar to that of the background, the appearance ambiguity causes the extracted target regions to be fragmental and incomplete. If the incomplete target regions are used to infer the 3-D scene statuses, the system accuracy will be deteriorated. Without using object-level information, data-driven methods usually suffer from poor accuracy in object detection and labeling.

On the other hand, for most top-down methods, the process usually begins at the training of a suitable object-based classifier. After the setting of the classifier is learned, the process can detect interested targets via the classification of image patches. Those object-based detection methods can obtain a complete detection result without fragments, but may lose the accurate silhouette of the interested targets.

Furthermore, when there are multiple targets inside the 3-D scene, the occlusions among targets could be crucial and may cause difficulty in object detection and labeling.

object-based information, an image observation layer for pixel-based information, and a labeling layer in the middle. This framework efficiently integrates top-down information with bottom-up messages. Based on the integration, top-down information and bottom-up messages cross-reference each other to support more robust and accurate inference. Moreover, the scene layer may also systematically model the interaction among multiple targets so that the proposed framework can effectively deal with the inter-target occlusion while doing the inference. This can further boost the system performance.

3.2.2 Differences to Existing Hybrid Methods

In recent years, a few hybrid frameworks that combine data-driven messages and model-driven information have been proposed to improve the performance of image labeling and object detection. In [60], the authors integrated image contexts and local appearance into a hybrid framework to provide improved image labeling results.

However, the detection problem has not been addressed in their method. In [61], a hierarchical conditional random field framework was proposed to model the interaction between image labeling and object detection. In this approach, the interaction is described based on scene-context relationship. However, the adopted segmentation process is mainly based on local features without taking into account the global shape layout constraints. In [62], a located-hidden-random-field framework has been proposed to label and detect objects simultaneously. This method mainly focuses on the detection of a single object and adopts an object labeling template that is treated as the global shape knowledge for object detection. Extra efforts are needed to identify the absence of objects or the presence of multiple objects. In [63], an extended work of located-hidden-random-fields framework, named layout-consistent random field framework, was proposed to further deal with inter-object occlusion. In

this method, inter-object occlusions are assumed to be unexpected and are handled by defining asymmetric pair-level potentials between adjacent labels.

Even though these aforementioned methods also integrate pixel-level, region-level, and object-level information for image content analysis, there are distinctive differences between our BHF-based modeling and theirs. In our approach, we couple the object-level information with the 3-D scene inference based on a unified parametric scene model. In the proposed BHF framework, since the cameras parameters have been calibrated beforehand, we can fully utilize the geometric knowledge in the monitored scene. Unlike previous methods which learn the object-level information from a bunch of training data, our BHF framework adopts the 3-D parametric scene model to synthesize geometric patterns for model learning.

In other word, we do not simply rely on training data for the learning of the object models. Moreover, in BHF modeling, the use of the parametric scene model has greatly reduced the dimension of the solution space. Since the possible status of each 3-D scene parameter is usually limited and can be quantized into a few choices, the possible solutions of image content labeling are well bounded.

Furthermore, since the 3-D scene is properly modeled, the occlusion effect, the perspective effect, and the shadow effect can be theoretically analyzed. To deal with the variations of the surrounding illumination and to integrate the geometric scene knowledge with image observation, a hidden labeling layer is included in the structure.

With the hidden layer between the observation layer and the scene layer, our framework provides a systematic structure that is very suitable for solving luminance variations, shadow effect, perspective effect, and occlusion.

In BHF, image labeling is modeled as a pixel-level classification process. By dynamically training the pixel-level classification models to adapt to luminance

labels. On the other hand, to handle the occlusion and shadow effect, the target number, target location, target size, and a few necessary scene factors are modeled as scene parameters. During the inference process, the statuses of those scene parameters are all inferred at the same time so that the occlusion effect and the shadow effect can be well handled.

Furthermore, for occlusions and shadows, the BHF framework can explicitly model their generation processes from 3-D scene to 2-D images. This makes occlusions and shadows a portion of the global knowledge. Hence, another distinctive feature of BHF is that occlusion and shadow effects may actually be used to offer useful and structured information to support scene inference. The occlusion effect tells how the 3-D objects in the scene interact with each other; while the shadow effect conveys the existing of certain objects. In BHF, these two effects are well modeled as parts of global knowledge. This kind of global knowledge may deduce expected labeling configuration when the scene parameters in the scene layer are specified. Under the BHF framework, scene modeling and image labeling processes are linked in an interactive manner. The labeling of image pixels adopts some global knowledge from the scene layer, while the scene layer makes a global inference based on local messages passed from the labeling process.