Multi-Target Labeling and Tracking - Bayesian Inference and Ghost Suppression

5.4 Bayesian Inference and Ghost Suppression

5.4.2 Multi-Target Labeling and Tracking

= ⎨⎪

⎪⎩ , (65)

where

W

1 and

W

2 are two constants with

W

1 ≥

W

2. In (65), N(SL) denotes the number of true targets in the status combination SL. In detail, if we know the ratio between

W

and

W

2, we could determine

W

2 such that the probability summation equals to 1. For example, we assume

W

1 = 2

W

2, the number of candidate targets at Time t is 5, and the number of true targets in the previous optimal combination

S is 4. For this

_o^t⁻¹

case, we have 2

W

₂⋅(C +C +C )⁵₃ ⁵₄ ⁵₅ +

W

₂⋅(C +C +C ) 1⁵₀ ⁵₁ ⁵₂ = . Hence, we choose

W

2 = 1/48 and

W

1 = 1/24.

5.4.2 Multi-Target Labeling and Tracking

5.4.2.1 Optimal Inference of Target Labeling

With the above deduction, the labeling of targets and the suppression of ghost targets can be solved by finding the optimal labeling images (

H ) and the optimal

_L^* status combination (

S ) that maximize the following potential function C

_L^* p(HL, SL):

* *

Basically, the problem of target labeling and ghost suppression is treated as a maximum a posterior (MAP) problem from the viewpoint of Bayesian generative model. Here, we incorporate four constraint terms: classification energy ED, adjacency energy EA, likelihood function p(HL|SL), and prior probability p(SL). As illustrated in Fig. 38, the classification energy ED[Fi(m,n),Hi(m,n)] represents the bottom-up constraint between the foreground detection images and the labeling images. To model the interaction between the labeling layer and the scene layer, the likelihood function p(HL|SL) represents the expected labeling layout based on the status combination SL. The expected inter-occluded patterns among candidate targets are also modeled in p(HL|SL) to influence the classification of local labeling nodes. By introducing the adjacency energy EA[Ii(m,n),Hi(m,n);Np], the proposed framework can not only infer the labeling based on the fusion of scene knowledge and foreground detection results, but also refine the labeling results based on the original image data.

Last, the prior probability p(SL) includes the temporal prediction based on the previous decision.

Moreover, due to the inter-occlusion among targets, the status inference of a candidate target may depend on some other candidate targets. Hence, we need to take into account relevant candidate targets when we infer the status of a candidate target.

A brute-force way is to evaluate all possible status combination and pick the optimal one as

S . However, this leads to exponentially growing computational complexity

_L^* as the number of candidate targets increases. Fortunately, in general, there could be

some kind of separateness among candidate targets that can be used to reduce the number of status hypotheses. In our system, if the projection of a candidate target on a camera view does not overlap with the projection of other targets, that candidate target is thought to be a true target. By excluding those targets with isolated projections, we only need to check the status combinations of the remaining targets.

For example, in Fig. 38, the target S5 corresponds to the left target in the third camera view. Since this target has an isolated projection in the third camera view, it is treated as a true target. For this case, we only generate 2⁴ status combinations for S1

, S

2, S3

and S4, instead of generating 2⁵ combinations for all five targets.

In principle, the best configuration of labels depends on image data, foreground detection result, and scene model. In our experiments, even though plentiful false alarms and false rejection may appear in the foreground detection results, these errors have little influence on the final inference result. Based on the proposed BHF, the inter-occlusion problem can be effectively analyzed, the connected foreground regions can be well separated, and the ghost targets can be correctly identified.

5.4.2.2 3-D Target Model Refinement

Usually, the moving targets in the surveillance zone may have different model parameters, such as the target height and width. If the personalized target models can be obtained, the performance of the proposed inference framework can be further boosted. In real situations, however, it is impractical to obtain the personalized 3-D model parameters in advance. Hence, in our system, we achieve personalized 3-D modeling by treating the model parameters as latent random variables and introduce an EM based algorithm to iteratively refine the model parameters. The basic idea is to update the 3-D model parameters in the Expectation step based on the labeling results derived from the optimization procedure in (66). Next, in the Maximization step, by

using (54) to consider the refined statistics of the 3-D model parameters in an expectation sense, the optimization procedure in (66) is re-executed to boost the inference performance. The operation is repeated until the updated parameters converge or the maximum iteration number is met.

In Fig. 41, we show an example of the labeling results with and without the target model refinement. Since each target has obvious height difference, the labeling results with a unified target model generate wrong labeling around the head regions as shown in Fig. 41(b). After the refinement of target model, more accurate labeling results are achieved, as shown in Fig. 41(c).

In our system, the major 3-D target model of each target is a pillar model standing at a location X on the ground plane, with parameters height (H) and width (R). Initially, the proposed EM algorithm uses the pre-trained probability distributions

p(H) and p(R) to model the uncertainty of each target height and width. With this

initial setting, the proposed BHF generates the optimal inference of target labeling.

Since the BHF combines not only the 3-D scene priors and target priors but also the observed image data and the corresponding foreground detection results, the optimal target labeling actually reveals the personal property of each detected target. Hence, based on the labeling results in multiple image views, we further update the probability distributions of H and R to establish personalized probability models. In practice, we found the target width has less uncertainty among targets and the pre-trained probability p(R) can well model the uncertainty in target width. Hence, in our system, only the model of target height is recursively refined.

(a)

(b)

(c)

Fig. 41. Illustration of the labeling results. (a) Two camera views. (b) Without and (c) with target model refinement.

In the Expectation step of the proposed EM procedure, the main goal is to refine the posterior probability of each target height given the multi-view labeling results. In our system, based on the Bayesian rule, the refinement of the posterior probability is defined as follows

( _k^r| )^r ( |^r _k^r) ( _k^r)

p H L

≡ ⋅

C p L H

⋅

p H

. (67) where

1 1

( ) if 1

( )

( | ) otherwise

k r r

p H r

p H p H ⁻ L⁻

⎧ =

= ⎨⎪

⎪⎩

In (67), L^r indicates the labeling results of multiple image views at the rth Iteration of

our EM procedure.

H is the height of the kth target at Iteration r, C is a

_k^r

normalization constant, ( |

p L H is the likelihood term which will be defined later,

^r _k^r)

and p H( _k^r) is the prior term of

H . In our system, we directly treat

_k^r p H⁽ k^r⁻¹^|L^r⁻¹⁾ as the prior information propagated from the previous iteration to the current iteration to set the prior p H( _k^r). Initially,

p H is set to be the pre-trained target height

( ¹_k) probability p(H).

To formulate the likelihood term ( |

p L H , we project the pillar model at the

^r _k^r) ground position of the kth target, with height

H and width R

_k^r k

, onto multiple camera

views and we verify the projected regions with the labeling results. Since the variables

H and R are assumed to be statistically independent, we assign the width of all targets

to be the mean value of p(R) during the computation of ^{p L H}^{( |}^r ^k^r⁾. Ideally, if a more precise target height is chosen, the projected region will better fit the labeling result.

Hence, we define the likelihood term as

( )

is the probability of the labeling pixel at (m,n) with the label ID “l”. N is the total number of pixels within the projected regions. Since different

H may generate

_k^r different projected regions, we use the function ( . )^1/^N for normalization. Moreover, we assume the statuses of different labeling pixels are independent of each other and we evaluate only those pixels inside the projected regions of the kth target. In principle, the label ID “l” tends to be T. Hence,

p

ⁱ ( )

l has a higher probability if

“l” equals to Tk and has a lower probability if “l” equals to T0. Occasionally, owing to occlusion, “l” may equal to some foreground target other than Tk. In this case, we do not have the information about Tk and we assign

p

_{m n}ⁱ_, ( )

l to be an intermediate value.

where

λ

is a normalization term to make the probability summation equal to 1.

Moreover, x, y, and z are empirically pre-selected parameters, with x > z > y. If we rewrite (68) based on (66), we get a likelihood form as below

0 pixels, and the number of other pixels inside the projected regions in all camera views.

Basically, (70) simply measures the matching level by accumulating the weighted sum of different labeling pixels inside the projected regions with the weighting parameters (x,y,z). Once the likelihood term ( |

p L H is determined, the refined probability

^r _k^r) distribution of the kth target height at the current iteration can be obtained based on (67). The refined model (

p H L is fed back to the proposed BHF to find the

_k^r | ^r) optimal object labeling again. In our experiments, 2~3 iterations are enough for the convergence of the EM algorithm.

5.4.2.3 Multi-target Tracking

In our system, by associating the temporal succession, we also extend the detection results to perform 3-D tracking over the ground plane. Basically, the object

tracking is treated as a dynamic system problem. Based on the proposed Bayesian detection framework, the major observation of the dynamic system comes from the estimated target location on the ground plane. In principle, to deal with the dynamic system problem, several Bayesian filter techniques can be used. For instance, we can use a Monte Carlo based framework to track multiple targets on the ground plane, as proposed in [92]. However, for the sake of computational simplicity, we adopt the Kalman filter to track each target in the scene.

在文檔中貝氏階層式結構於視訊監控之研究與應用 (頁 128-135)