For different video surveillance systems, the system unknowns, the available physical constraints, and the available observations are application-dependent. In BHF, we treat the system observations and unknowns as random variables and represent them as nodes in the BHF structure. Through a learning procedure, we train appropriate probability models to model the physical constraints which are the links in the BHF structure. With the integration of system unknowns, system observations,
bottom-up constraints, and top-down constraints under the hierarchical framework, the analysis of image contents and the inference of the scene statuses are formulated as an optimization problem. By finding the optimal inference, the system can make a semantic understanding of the monitored scene.
In BHF, the inter-layer links and intra-layer links represent the message propagations that should be properly modeled. As illustrated in Fig. 14, observation nodes are assumed to be conditionally independent when the statuses of the labeling layer is given. This implies no connections among observation nodes. On the other hand, one labeling node represents a local decision based on a local observation.
Hence, there is a link connecting each labeling node and its corresponding observation node. Moreover, the local decisions of two adjacent labeling nodes are usually highly correlated. This property is modeled by connecting the labeling nodes as a four-neighbor Markov random field (MRF) [18]. To model the interactions between the labeling layer and the scene layer, each scene node that represents one kind of 3-D scene status is connected to related labeling nodes. Through those connections, the global information of geometric arrangement may influence the classification of local labeling nodes. In BHF, the topology of the inter-layer connection is flexible and application-oriented. In Chapter 4 and Chapter 5, we will apply the BHF framework to two different applications, a parking space detection system and a multi-camera surveillance system, to demonstrate how to define the nodes and how to model the links of the BHF structure in real applications.
In principle, we can formulate the scene inference problem as a status decision process based on image observations. Since the process of image content analysis and the inference of the scene status are highly correlated, the proposed BHF is developed to combine the image labeling problem p(HL|IL) and the scene inference problem
the system goal as to simultaneously find the optimal image content labeling and the 3-D scene parameters based on the image observation and some model constraints. By unifying these two problems under a single framework, the connections among pixel-level features, region-level constraints, and object-level knowledge are well-constituted in a hierarchical form. This structure enables the proper use of the information embedded among layers and provides an efficient way to deal with scene inference and image content analysis simultaneously rather than to solve them individually. To find out a suitable classification label HL and the best scene inference
S
L under the given observation IL, an MAP optimization problem is defined as* * scene parameters. Here, p(SL) represents the prior knowledge of the 3-D scene status and
p(H
L|S
L) stands for the object-level constraints propagated from the 3-D parametric scene model to the labeling layer. In the graphical structure of our BHF, we use the links between the scene layer and the labeling layer to represent p(HL|S
L).On the other hand, we assume p(IL
|H
L,S
L) = p(IL|H
L). That is, we assume the probabilistic property of the observed image data is conditionally independent of the scene model once if the pixel labels are determined. Moreover, p(IL|H
L) links the image observation data with the labeling results. In detail, p(IL|H
L) is composed of a pixel classification model for pixel-level information and an adjacency model for region-level information. As mentioned above, for the pixel classification model, we assume the observation nodes in Fig. 14 are conditionally independent when the statusof the labeling layer is given. In addition, we assume the connections between the observation layer and the labeling layer are one-to-one and these connections can be modeled in terms of a “classification energy” ED[IL(m,n),HL(m,n)]. This classification energy conveys the property that the labeling result should be consistent with the feature values of the observed image. On the other hand, for the adjacency model, since the local labeling results of adjacent nodes are usually highly correlated, we define an “adjacency energy” EA[IL(m,n),HL(m,n);Np] to depict the assumption that the labels of adjacent pixels should follow some kind of smoothness constraint. By combining these two energy models, we have
[ ( , ), ( , ); ]
Here, Np denotes a neighborhood around the pixel location (m,n) and K is a normalization term.
In our system, p(IL
|H
L) and p(HL|S
L) need to be explicitly determined in order to completely model the system goal as an optimization problem in (10). Once the models of BHF are defined, an optimal inference procedure is performed to obtain the results. In our BHF, the definition of the 3-D parametric scene model p(HL|S
L) and the pixel classification model ED[IL(m,n),HL(m,n)] are highly application-dependent. In order to explain the modeling of p(HL|S
L) and ED[IL(m,n),HL(m,n)], two examples will be demonstrated in Chapter 4 and Chapter 5, respectively.On the other hand, the adjacency model EA[IL(m,n),HL(m,n);Np] defined in the BHF framework is more generic. Usually, the local decisions of two adjacent labeling nodes are highly correlated especially when their corresponding image pixels share similar color features. In our system, by taking the observed image IL(m,n) into consideration, we define the adjacency energy of labeling nodes as a Markov random field [18] to provide a smoothness constraint between adjacent labeling nodes. Here,
we define pre-selected penalty constant. In (13), the function GS is an adaptive function designed to preserve the intensity/color discontinuities in the original image. In our system, we design function GS to be a function similar to a logistic sigmoid function:
( )
( ) 1 (1- ( -U Cth)) (1 ( -U Cth)) 1G U
S =Sigm U
+ =e
ρ +e
ρ + . (15)An example of Sigm(U) is shown in Fig. 15. In principle, Sigm(U) works like a soft thresholding function, with Cth and
ρ
controlling its zero-crossing point and shape, respectively. Both Cth andρ
are application-dependent and are determined empirically. Sigm(U) outputs a positive value if U is smaller than Cth, and outputs a negative value otherwise. With this design, CA[.] is equal to zero when HL(m,n) andH
L(m+Δm,n+Δn) are the same. If HL(m,n) and HL(m+Δm,n+Δn) are different, CA[.]gives a larger penalty if the difference between IL(m,n) and IL(m+Δm,n+Δn) is smaller than Cth, while gives a smaller penalty otherwise. Hence, to reduce the adjacency energy, Hi(m,n) and Hi(m+Δm,n+Δn) tend to share the same label when the difference between IL(m,n) and IL(m+Δm,n+Δn) is small, and tend to have different labels otherwise.