Background Model Initialization via Classification
2.4 Experimental Results
2.4.2 Performance Evaluation
Since the classification efficiency of CGBoost is more than 20 times faster than that of an SVM implementation (see Table 2.1), we describe below only the experimen-tal results yielded by using the CGBoost classifier fB∗. For testing the generality of the proposed scheme, all the to-be-estimated scenes of the testing sequences are completely different from those of the training data. The testing sequences also contain complex motions, e.g., substantial object interactions, and varied lighting conditions, like cloudiness.
(a) A000, eB0∗ (b) A044, eB44∗ (c) A261, eB∗261 (d) A510, eB510∗ (e) A650, eB650∗
(f)A000− eB650∗ (g) A044− eB650∗ (h) A261− eB650∗ (i) A510− eB650∗ (j) A650− eB650∗ Figure 2.6: Results of background model initialization. (a)–(e) The upper row shows image frames from sequence A, and the lower row depicts the progressive estimation results. The initial background model is completed at t∗ = 650. (f)–(j) The frame subtraction results by referencing the derived background model eB650∗ .
Background model initialization
We first demonstrate the efficiency of our method for an outdoor environment.
The sequenceA contains different types of objects, including slightly waving trees, walking people, slow and fast moving vehicles, and even a stationary bike rider.
We shall use this example as a benchmark to analyze the quality of our results, detection rates, and comparisons to other existing algorithms. As illustrated in Figs. 2.6 (a) and (b), the background model is initialized into an empty set at t = 0, and it is until the 44th frame that stationary regions of the scene are started to be incorporated into the model (due to N = 45 in our setting). Fig. 2.6 (c) shows a very slow moving car is falsely adapted into the background in transient (and is eventually removed after its leaving the scene). More interesting is the scenario
(a) A020 (b) A166 (c) A261 (d) A372 (e) A540 Figure 2.7: Results of background block detection. Row one: Image frames from sequence A. Row two: The manually labeled foreground (white) and background (black) maps. Note that the very slow-moving car in (c) that later becomes fully stationary in (d) is labeled as foreground and background, respectively. Row three: Our background block detection results. The foreground blocks in gray are identified by the top-down validation process.
depicted in Figs. 2.6 (d) and (e) that a bike rider waiting for a green traffic light has remained still long enough to become a part of the derived back-ground model at t∗ = 650. Then the system can start to track objects via frame differencing and proper model updating. On the other hand, if we subtract the model from the first t∗ frames, it gives the complexity of how the background model is initialized.
Factors such as dark shadows and waving trees can now be easily identified from those shown in Figs. 2.6 (f)-(j).
Background block detection
To quantitatively evaluate the accuracy of the bottom-up block classifications and the improvement with the top-down validations, we select twenty image frames
Table 2.3: Detection error rates with and without top-down validation.
BG/FG Block Without With %
Detection top-down top-down Improvement Detection Error Rate∗ 0.04142 0.03779 8.764 % False Positive Rate 0.02246 0.01825 18.744 % False Negative Rate 0.01896 0.01954 -3.059 %
* Detection Error Rate = # of Misclassified Blocks / # of Testing Blocks
from sequenceA that contain moving objects of different sizes and speeds, specular light, and shadows. We then manually label each image block of the twenty frames to result in a set of 20061 background and 3939 foreground blocks, where we shall use them to examine the accuracy of our scheme for background block detection.
In Fig. 2.7, we show results for five selected frames. Note that those gray blocks are detected as foreground through the top-down validation process. To further justify the need of a local and global approach, a comparison of the detection error rates with or without the top-down validation step is given in Table 2.3.
Though the values of detection rates could vary from testing our system in different environments, it is clear that the improvement of reducing the errors by applying the top-down validation is significant. As in this example, the reduction rate of false positives is about 18.744% while the increase rate of false negatives is only 3.059%. Two observations could arise from the foregoing verification for the accuracy of our scheme in detecting background blocks.
• For the classifier to accommodate small variations like waving trees, it may mistakenly classify very slow-moving objects into background (see Fig. 2.6 (c) and Fig. 2.7 (c)). This is indeed a trade-off, and we resolve the issue by learning a proper decision boundary from the training data.
• Our classification scheme may suffer from the aperture problem in detecting large objects in that we use motion features to construct a general classifier (see Figs. 2.7 (a) and (c)). With the top-down validation, this problem can be alleviated to some degree. Still a number of false positives caused by the aperture problem exist frame-wise. However, since only the same false positive occurring for N consecutive frames would be adapted into a background model, such an event rarely happens in practice (with a very low probability, e.g., around 0.01825N for the example in Table 2.3).
Feature selection
In our design, two general motion cues, the inter-frame difference and the optical flow value, are adopted to discriminate background scenes. While the inter-frame difference is effective in detecting static background blocks, the optical flow value, on the other hand, provides discriminability in classifying image blocks in small motions into gradually-varying background or moving foreground. To further jus-tify the use of the optical flow cue, additional evaluations using the inter-frame difference alone are provided. With the best setting of the difference threshold at 0.013, the training error is raised from 0.0466 to 0.0528 (or a 13.3% increase), and the testing error for the 20 evaluation image frames increases from 0.0378 to 0.0436 (or a 15.3% increase). Hence, the benefit of incorporating the optical flow value is obvious.
Parameter settings
We next investigate the sensitivity of our method with respect to different values
(a) N = 30 (b) N = 45 (c) N = 60 (d) N = 75 (e) N = 90 Figure 2.8: Results of different parameter settings. In each case, we show the image frame It∗ (above) that our system completes its estimation for a MAP initial background model (below). Only for N = 30, it would produce an unstable estimation due to the violation of stationary criterion. Different values of N and L (given in Definition 1) mainly affect the needed time to derive a stable background model. Respectively, it takes 403, 650, 678, 1001, and 1031 frames to compute the initial background models.
scheme.) Specifically, we have experimented with N = 30, 45, 60, 75, and 90. We show in Fig. 2.8 that, with different values of N and L, it mainly affects the needed time to compute a stable initial background model. The larger the value of N is, the longer period of time it takes to complete the estimation. Except for N = 30, which is too short a time period for yielding a stationary adaptation, all other settings of N lead to stable background models.
Comparisons of Background Model Completeness
A clear advantage of our formulation is the ability to know when a well-defined initial background model is ready to be used for tracking. We demonstrate this point by making comparisons with the popular mixture of Gaussians model [54]
and the local image flow approach [22]. While the two methods are also effective for background initialization, they both lack a clear definition of what an underlying
background scene is at any time instant of the estimation processes. For systems based on the mixture of Gaussians, they work by memorizing a certain number of modes for each pixel, and then by pixel-wise integrating the most probable modes to form a background model. This is in essence a local scheme that the overall quality of a background model is difficult to evaluate. On the other hand, the method described in [22] is designed to process a whole image sequence to output a background model. We thus need to modify the algorithm into a sequential one so that the comparisons can be done by frame-wise examining the respectively derived background models.
The first experiment is carried out with image sequence A where the three algorithms are alternately run till the image frame t∗ = 650 that our method completes its estimation for an initial background model. For the mixture model, we use three Gaussian distributions and a blending rate of 0.01, and initialize the background model at t = 0 to the first image frame. For the local image flow implementation, the values of w and δmax are set to 30 and 15, and the background model is an empty set at t = 0. In Fig. 2.9, we show some intermediate results of ours and the corresponding background models produced by the other two methods. Due to the batch nature of the local image flow scheme, its three background models shown in Fig. 2.9 are obtained by running the algorithm three times, using the respective periods of image frames as the inputs. Overall, the results produced by ours and the mixture of Gaussians are more reliable than those of the local image flow, largely because the local flow scheme relies heavily on the estimations of optical flow directions and their accuracy. While the outcomes by the mixture of Gaussians seem to be satisfactory and similar to ours, the absence of
(a) A000 (b) A044 (c) A261 (d) A650 Figure 2.9: Comparisons of [22], [54], and our approach. Row one: Images frames from image sequence A. Row two: The intermediate results of background es-timation by our method that completes at t∗ = 650. Row three and four: The results respectively derived by the local image flow approach [22] and the mixture of Gaussians method [54] at each corresponding time instant.
remains a disadvantage of the approach. Furthermore, as one would expect that a mixture of Gaussians method should be sensitive to lighting variations in that it is done by locally combining pixel intensities. We shall further elaborate on this issue with the next experiment.
Our second comparison focuses on the effects of lighting changes. For the outdoor sequence B (see Fig. 2.10), the lighting condition varies rapidly due to overcast clouds. And the experimental results show that our method is less sensi-tive to variations of this kind. Specifically, in Figs. 2.10 (e) and (f), we enlarge the sizes and enhance the contrasts of the two derived background models for a clearer view. Note that especially in the road area our background model estimation is clearly of better quality than the one yielded by the mixture of Gaussians. This is mostly because of our uses of motion cues for identifying background blocks and the properties of the MAP background model for integrating local and global consistency. On the other hand, the mixture of Gaussians approach uses only the pixel-wise intensity information so that its performance depends critically on the variations of intensity distribution about the background scene.
Initialization and tracking
To further illustrate the efficiency of using our proposed algorithm to estimate a background model for tracking, we show the estimations of initial background mod-els of test sequences C and D, and some subsequent tracking results in Fig. 2.11.
Below each depicted image frame It, the corresponding background model eBt∗ is plotted. In the two experiments, the estimations of the initial background model Bet∗∗ are completed at frame number t∗ = 470, and 243, respectively. Once the eBt∗∗
(a) B008 (b) B165 (c) B260 (d) B396
(e) Mixture of Gaussians (f) MAP
Figure 2.10: Tests on lighting variations with the sequence B. (a)–(d) Due to overcast clouds, the outdoor lightings over the road change significantly throughout the sequence. As a result, the quality of background models yielded by the mixture of Gaussians is considerably affected. However, our formulation is more robust to such lighting perturbations. (e)–(f) The two derived background models at t∗ = 475 are enlarged and enhanced in contrast. False textures and extra noises can be observed in the road areas of (e).
described in [6]. We also note that, as demonstrated in Figs. 2.11 (j)-(l), the back-ground model can be updated appropriately during tracking, even when significant changes in the scene background have occurred.
(a) C050 (b) C295 (g) D055 (h) D146
(c) C470 (t∗ = 470) (d) C501 (i)D243 (t∗ = 243) (j) D338
(e) C535 (f) C549 (k) D373 (l)D610
Figure 2.11: Background model initialization and tracking. The current image frame It and the derived background model eBt∗ are plotted together, top and bottom, respectively. Some Tracking results are also shown in the Its.