System Modeling - Bayesian Inference and Ghost Suppression

5.4 Bayesian Inference and Ghost Suppression

5.4.1 System Modeling

5.4.1.1 Bayesian Hierarchical Framework

In this system, we adopt the BHF framework to simultaneously infer the status of candidate targets. In Fig. 38, without loss of generality, we consider an example of TDP distribution fused from four camera views. The top layer of the BHF architecture is the scene layer SL that indicates the 3-D scene knowledge built at the fusion stage.

Here, we treat the scene model as a knowledge pool collecting message from all cameras. The bottom layer is the observation layer IL, which contains both the captured images and the corresponding foreground detection results. We define Ii(m,n) and Fi(m,n) as the captured image and the foreground detection result of the ith camera view, respectively. The value of Fi(m,n) is defined as in (50). Between the scene layer and the observation layer, a labeling layer HL is added to deal with image labeling, target correspondence, and ghost removal. Here, we define Li(m,n) as the labeling image of the ith camera view.

5.4.1.2 Problem Formulation

In the “five candidate targets” case in Fig. 39, the scene layer SL = {s1, s2, s3, s4, s5} corresponds to the status of five candidate targets, with each status node being either

“true” (1) or “ghost” (0). With five candidate targets, we have 2⁵ status combinations in total. For each combination, we generate the expected foreground occlusion pattern by approximating each “true” target as a rectangle pillar on the ground. By projecting the 3-D rectangle pillars onto each camera view, we form the expected foreground image. Ideally, the optimal status combination would lead to the best match between the expected foreground image and the detected foreground image. In Fig. 39, we show two status combinations based on the example in Fig. 38. In Fig. 39(a), the

scene layer with five candidate targets, together with two of the four camera views, is shown for reference. In Fig. 39(b), we show the combination {s1, s2, s3, s4, s5} = {1,0,1,1,1}, which assumes the second candidate is a ghost while the remaining are true. By projecting the four 3-D pillars onto the camera views, we compare the expected foreground image with the detected foreground image. In Fig. 39(c), we show another combination {1,1,1,1,1}, which assumes all candidates are true targets.

By checking the projected foreground images, it appears that the latter combination is less likely than the former combination.

Fig. 39. (a) The scene layer in Figure 36 and two of the four camera views. (b) The combination {s

, s

}={1,0,1,1,1} and the expected foreground images overlaid with the detected foreground images. (c) The combination {1,1,1,1,1} and the expected foreground images overlaid with the detected foreground images.

Assume there are N camera views and we have identified M candidate targets based on the fused TDP distribution. In our system, targets correspondence and image labeling are achieved by assigning a suitable ID from the set {T0, T 1, …, TM} to each pixel of the N labeling images. Note that Tk is the ID of the kth target and T0

represents the “background” object. Labeling and ghost suppression is achieved by

searching the optimal status combination that fits the foreground detection results.

Here, we denote the observation layer as IL

= (I,F), where I indicates the set of N

original images and F indicates the set of N foreground detection images. Moreover, we denote the labeling layer HL as the set of N labeling images, and the scene layer SL

as a status combination. With those definitions, we may combine the target labeling problem and the ghost suppression problem into a single MAP (Maximum A Posteriori) problem as introduced in Section 3.3. In this MAP problem, given the observation IL = (I,F), we seek the optimal status combination

S

_L^* and the optimal

In (58), ln[p(I,F|HL)] describes the relation between the labeling images and the observation data, ln[p(HL|SL)] describes the relation between the 3-D scene model and the 2-D labeling images, and ln[p(SL)] describes the prior information about the 3-D scene model.

5.4.1.3 Learning of p(I, F | H

)

As illustrated in Section 3.3, p(IL

|H

L) is composed of a “classification energy”

E

D[IL(m,n),HL(m,n)] and an “adjacency energy” EA[IL(m,n),HL(m,n);Np]. Hence, we formulate p(I,F|HL) as

( , | _L) exp( _D[ ( , ), ( , )])exp(_i _i _A[ ( , ), ( , );_i _i _p])

i m n

p I F H = ⋅K

∏∏∏

−E F m n H m n −E I m n H m n N ^{. (59)} In (59), ED[Fi(m,n),Hi(m,n)] denotes the “classification energy” that relates the ith foreground detection image with the ith labeling image; EA[Ii(m,n),Hi(m,n);Np] denotes the “adjacency energy” that relates the ith original image with the ith labeling

image by checking the adjacent property within the neighborhood Np; and K is a normalization term.

Ideally, if the foreground detection results are perfect, we expect Hi(m,n) to be T0

if Fi(m,n) is 0, and to be an element of {T1, T2, …, TM} if Fi(m,n) is 1. Once a labeling violates this expectation, an empirically selected constant

α

is added onto the detection energy to panelize this inference. Hence, we define ED[Fi(m,n),Hi(m,n)] as

( ( , ), ( , )) {1 [ ( , ), ( ( , ))]}

On the other hand, the local decisions of two adjacent labeling nodes are usually highly correlated, especially when their corresponding image pixels share similar color features. Hence, we define the adjacency energy EA[Ii(m,n),Hi(m,n);Np] by using the same smooth constraint presented in Section 3.3. Here, we briefly explain the design of the adjacency energy model again.

In our system, the adjacency energy EA[Ii(m,n),Hi(m,n);Np] is defined as.

With this definition, if two neighboring sites are set to different labels, our system will give a larger penalty if we find the color difference between two sites is small.

Otherwise, our system will give a smaller penalty. That is, two neighboring sites tend to share the same label when the difference between their color features is small, and tend to have different labels otherwise.

5.4.1.4 Learning of p(H

|S

)

Given a status combination SL, we define a conditional probability p(Hi(m,n)=Tk|SL) to express the likelihood of having a label Tk at the pixel (m,n) of the ith labeling image. Here, we construct the probability model in a Monte Carlo manner. With the status combination S, we define a few rectangular pillars on the ground. The height and width of each pillar are sampled based on the probability density functions p(H) and p(R). The locations of the pillars are sampled from p(X|Tk), where Tk indicates the

kth target. With the camera projection parameters, the expected foreground patterns

for each target can be generated by projecting these rectangular pillars onto each camera view. Occasionally, more than two targets may project onto the same image region and cause occlusion. The inter-occluded patterns can be determined by checking the distance from the camera to the mean location of the targets. In Fig. 40, we demonstrate the occlusion effect by plotting p(Hi(m,n)=Tk|SL) individually for each of the four targets in Fig. 38 (b).

Based on the definition of p(Hi(m,n)=Tk|SL), we have ( _L| _L) ( _i( , ) | _L)

i m n

p H S ≡

∏∏∏

p H m n S (63) and we define the log probability function ln[p(HL|SL)] as

ln ( _L| _L) ln ( _i( , ) | _L)

i m n

p H S =

∑∑∑

p H m n S . (64)

Fig. 40. Examples of p(H

(m,n) = T

|S)

the optimal status combination. In our system, if Mt true targets are identified at the previous time instant, we assume it is more likely to have a similar number of true targets at the current moment. That is, if we denote S_o^t⁻¹ as the optimal status

combination at the previous time instant (t-1) and

S

^tas a status combination at the current time instant t, we define the prior probability of S^t as

1 1

, if 1

( ) , otherwise

t t

W N(S ) N(S )

p S W

⎧ − − ≤

= ⎨⎪

⎪⎩ , (65)

where

W

1 and

W

2 are two constants with

W

1 ≥

W

2. In (65), N(SL) denotes the number of true targets in the status combination SL. In detail, if we know the ratio between

W

and

W

2, we could determine

W

2 such that the probability summation equals to 1. For example, we assume

W

1 = 2

W

2, the number of candidate targets at Time t is 5, and the number of true targets in the previous optimal combination

S is 4. For this

_o^t⁻¹

case, we have 2

W

₂⋅(C +C +C )⁵₃ ⁵₄ ⁵₅ +

W

₂⋅(C +C +C ) 1⁵₀ ⁵₁ ⁵₂ = . Hence, we choose

W

2 = 1/48 and

W

1 = 1/24.

在文檔中貝氏階層式結構於視訊監控之研究與應用 (頁 123-128)