Chapter 4. Implementation Issues and Experimental Results
4.2 Build Background
In real surveillance system, the background image must be modeled. Even if the camera is static, the first acquired image can’t be used as the background because it may contain moving objects. The background is obtained by analyzing successive images after a period of time. This is called background modeling.
Ahmed Elgammal et al [18] proposed a way to model the foreground and background by using nonparametric kernel density functions. Although the Gaussian form of the density function can be implemented by a lookup table, the modeling method in [18] needs many sample frames to make the model more reliable. The storage and the computational time increase simultaneously. In this thesis, a simpler background modeling procedure will be introduced in this section.
4.2.1. Motion Detection without Background
The motion part in sequences can be extracted more easily by inter-frame difference. Boundaries of the moving objects are identified by setting a threshold on the absolute difference result.
Manipulating the image data pixel by pixel is very sensitive to noise. Hence, the inter-frame difference image is divided into many blocks and the sum of the absolute difference (SAD) within each block is used to identify the variation property. If the SAD of the block is above a predefined threshold, the block is classified as a motion block; otherwise, the block may be located on the background or the interior region of a moving object.
Experiments of different threshold value are shown in Figure 4-9. The value d represents the average difference per pixel in the block. If the SAD of the block is beyond d×(block size)2, the block is marked as a motion block. The “black” pixels in Figure 4-9 indicate the motion blocks. The block size is 8×8. When d = 0, every block in the image is detected as motion block. As the d value increases, the number of detected blocks decreases. From the experiment result, d = 5 is chosen empirically.
In this example with the 8x8 block size, a block is classified as a motion block if the sum of absolute differenceSAD>5×(8)2 =320.
d = 0 d = 2 d = 3 d = 5
d = 7 d = 11 d = 15 d = 19
Figure 4-9 Motion Detection by thresholding the inter-frame difference image. The black pixels represents the motion block with the measurement d.
The size of block affects the storage requirement of the detection result. A smaller block size produces more blocks and the required storage increases. A smaller block is also more sensitive to noise. However, a smaller block size provides a better precision for the positioning of moving objects.
In the case d = 5 in Figure 4-9, only the boundary of the moving object is detected. A more solid result is expected; otherwise, the center of the moving object may become the background region. The property of the level surface evolution in the level set theory may help classifying these cracked points. The level surface is updated by the following equation:
( )
( )
ϕϕ = × − ∇
∂
∂
j i j
i u d SAD
t
block)2
(motion , (Eq. 4-11)
where u(x) is the unit step function and
( ) ( )
∑∑
= = − + − + − − − + − += N
p N
q
t t
j
i I i N p j N q I i N p j N q
SAD
1 1
1 ( 1) ,( 1)
) 1 ( , ) 1
( .
(Eq. 4-12)
Here, N is the block size. The level surfaces moves upward if the SAD is less than block)2
(motion
×
d .
Dilation and erosion are used to link the boundary of the moving objects. In Figure 4-10 (b) and Figure 4-10 (c), 3x3 and 5x5 structuring elements are applied,
respectively. The result with the 3x3 structuring element still contains discontinuity on the boundary of the moving object. The result of the 5x5 structuring element is better.
Figure 4-10 (d) is the result which applies (Eq. 4-11). The contour propagation is shown in Figure 4-11.
(a) Delete the isolated points (b) Closing with 3x3 structure element
(c) Closing with 5x5 structure element (d) Use (Eq. 4-11)
Figure 4-10 Make a solid detection result. (a) The blue blocks show the original SAD detection result. (b) Morphological closing with a 3x3 structuring element. (c) Morphological closing with a 5x5 structuring element. (d) Fill the hole inside the moving object via (Eq. 4-11)
Iteration 1 Iteration 4 Iteration 7 Iteration 10
Iteration 13 Iteration 16 Iteration 19 Iteration 22 Figure 4-11 Contour propagation using (Eq. 4-11).
4.2.2. Background Modeling
If the motion regions are extracted, the background can be built by accumulating the static image data. In the Video Surveillance and Monitoring (VSAM) project [19]
at Carnegie Mellon University, a layered detection method is proposed. Every pixel in the image has three states: transient, static, and background. Pixels with high inter-frame difference value are defined as transient pixels. Otherwise, their states are defined to be static. Static pixels become background pixels if they keep their “static”
state for a long enough time. The relationship between these three states is defined graphically. Moreover, the motion regions detected at different times are store in different layered maps so this method may deal with the occlusion problem.
If the initial background information is not available, the information has to be extracted from the sequences which contain moving objects. A preset background model is not adequate in a practical surveillance system because the background may change over time. Two types of blocks are defined using the graph concept in [19].
One is the “background” block, while the other is the “motion” block. The “static”
block is not necessary here because the static region inside the moving objects is included in the motion region based on the method mentioned in Section 4.2.1. If a block is defined as a non-motion block in successive n frames, the averaged RGB values are defined as background. If the block remains static less than n frames, then we reset the counter and clear the accumulated value in the buffer.
The test sequences are shown in Figure 4-12. The accumulation map is shown in
region in the map is black because the accumulation number is reset to zero. When the person leaves its original position, the static background information occupied by the person is accumulated. Eventually the values of the entire accumulation map are n and the background modeling is completed. In this case, n is set to be 30. The background is built after 90 frames or about 3 seconds. The background modeling time strongly depends on the displacement of the moving object. If the object keeps moving in the same position, the background can’t be built until the object leaves the position.
Frame 1 Frame 10 Frame 20 Frame 30 Frame 40
Frame 50 Frame 60 Frame 70 Frame 80 Frame 90
Figure 4-12 Test sequences for background modeling.
Frame 1 Frame 10 Frame 20 Frame 30 Frame 40
Frame 50 Frame 60 Frame 70 Frame 80 Frame 90
Figure 4-13 Background modeling process with n = 30. The white regions are the background region.
When n increases, the required number of frames increases. Figure 4-14 (a) and (b) show the results of n = 30 and n = 60, respectively. The discontinuous regions in
the pink ellipses are due to the different intensity values from different frames. The discontinuous defects can be tolerated because the subtraction data will be classified by the “active contours without edges” model. A larger n costs extra consuming time and does not necessarily produce a better result.
(a) n = 30 needs 91 frames (3.03 seconds) (b) n = 60 needs 580 frames (19.33 seconds) Figure 4-14 Background modeling using different n’s. (a) n = 30, (b) n = 60. The
discontinuous regions in the pink ellipses are due to the different intensity values from different frames.