• 沒有找到結果。

Chapter 1 Introduction

1.3 Organization

The rest of the paper is organized as follows. Chapter 2 presents the video object segmentation and our proposed shadow elimination algorithm. Chapter 3 presents the method of video object tracking including two matching algorithms. Chapter 4 presents the method to analyze the abnormal behavior of human. Chapter 5 shows the architecture of the system and the experimental results. We make a conclusion in Chapter 6.

Chapter 2

Object Segmentation and Features Extraction

The first module in the proposed surveillance system is video object segmentation. The task of video object segmentation is to find a mask indicating the shape and the position of the moving objects. In this chapter, we introduce the related work of video object segmentation in Section 2.1. In Section 2.2, we present the proposed method of video object segmentation including the shadow elimination. In Section 2.3, we introduce the features extracted in our system.

2.1 Related Work of Video Object Segmentation

Segmentation algorithms can be classified into two categories, the homogeneity based methods and the change detection based methods. The homogeneity based algorithms [1-4] segment moving objects based on the homogeneity of their color, texture or motion information. Pixels or blocks with similar features are first grouped into small regions, and these regions are then grouped into objects with some other features. Though this kind [1-4] of algorithms can provide precise object masks, however the watershed algorithm for the boundary decision is a computational expensive process. Also, the motion estimation process to compute the precise motion vectors for clustering small regions also takes a lot of time. Thus, this kind of algorithm is not a good choice for a real -time system.

The other category of segmentation algorithms is the change detection based algorithms. This kind of algorithms [5-7] segment objects by taking difference between the current input frame and a reference image, and then a threshold is chosen

to decide a difference mask indicating the shape and the position of the moving objects. Traditionally, the previous input frame is chosen as the reference image.

However, there are some well-known drawbacks [8].

First, when the speed of the moving object is not consistent, it becomes impossible to indicate the position using the difference image and thus miss or false alarm in segmentation is unavoidable. Second, the uncovered background is another problem in traditional change-detection algorithms because the uncovered background regions covered by objects in previous frame may be considered as changed.

Although uncovered background can be detected and removed when the motion information is taken into consideration, the computation of motion estimation is expensive and greatly lowers the efficiency of the change-detection algorithms.

Recently, some change detection algorithms [8-13, 19] use a reference background image to segment moving objects. The reference background image, which contains the still background without any objects, is acquired beforehand or by some means to be updated dynamically. The change-detection algorithms with registered background effectively solve the problems of uncovered background and inconsistent object speed. Besides, they are efficient and can meet the real-time requirement.

For the change-detection based algorithm, shadow is always extracted accompanying with object [12-14]. The main idea to remove shadow tries to classify the object pixels into real moving objects or shadows. Because of the real time requirement, a simple but efficient algorithm to remove shadows is expected. Based on the shadow elimination method in [12], we develop a new shadow elimination algorithm and present in Section 2.2.5. In our proposed system, we adopt the change-detection based algorithm with registered background to segment moving objects.

2.2 Video Object Segmentation

In our object-based tracking and human behavior analysis system, the first step is to segment the moving objects as precisely as possible. The object segmentation algorithm directly takes the raw video data as input to segment the moving objects in the surveillance video sequences and extracts the object masks to indicate the presence of the moving objects. In the surveillance video, since the position of the camera is always fixed and the background is stationary, so the simplest way to segment moving objects is to use the change detection based method. When comparing a frame to a background image, it is straightforward to consider the regions that change significantly as moving objects. Therefore, selecting background image as reference for the change detection based algorithm can effectively achieve our goal. However, this simple method may fail when the background changes. For reasons above, we use a more complex method to update the background and then to extract moving objects.

However, besides the moving objects that we are interested in, there are other types of changing regions that may be miss-classified during the segmentation process.

We classify these miss-classified regions into two types. The first type is the camera noises which are the white noise of the camera and usually small. The second type is

‘ghost’ which is the changing region that appears and then disappears quickly without steady motion and is usually bigger than the camera noise. The ghost effect is usually resulted from the waving of tree leaves and regional lighting effect. In order to obtain accurate object masks, these annoying changing regions should be filtered out.

In our segmentation algorithm, as shown in Fig.2, we use the change detection based algorithm with background registration and use a size filter to remove the noises and ghosts.

Fig. 2. Segmentation process diagram

2.2.1 Inter-frame Differencing

In the segmentation algorithm, we first compute the inter-frame difference image (DI) between the current input frame and the previous frame. The difference value in the DI shows how strong a pixel change in two consecutive frames and the possibility that a pixel will be considered as changing. Because the human eyes are more sensitive to luminance than to chrominance, we only take difference value on the luminance channel. After taking threshold THd on the difference image, we obtain a difference mask (DM) that indicates the possible changing regions between two

consecutive frames. The computations of DI and DM are shown in Eq.(1) and Eq.(2), where the IY(i, j, t) and IY(i, j, t-1) denotes the pixel value at (i, j) in current frame and previous frame in the luminance channel, respectively. The DM is used to update the background image in the next step.

2.2.2 Dynamic Threshold Decision

In order to make the segmentation adapt to various kinds of environments and video contents, the threshold THd for deciding the DM cannot be fixed and should be selected adaptively. In many researches [8, 10, 43, 44] the values in the difference image can be modeled by a mixture of two distributions. Thus, finding the threshold corresponds to finding the two distribution functions that approximate the histogram of the difference values. Traditionally, the valley between two peaks is found and is chosen as the threshold dividing two distributions. However, in the real case as shown in Fig.3, the histogram fluctuates heavily and it is difficult to find a threshold just by finding a valley. In [43], Wu et al. suggest that the histogram can be converted to a monotonic increasing histogram by accumulating the original histogram values, as shown in Fig.4. In the cumulative histogram, the problem of finding a threshold can be simplified to finding an intermediate point such that two straight lines which are interpolated by the start point, end point and the intermediate point can best approximate the cumulative histogram. Instead of using the ratio histogram in [43], we simply use the difference since the computational cost is much expensive for the ratio histogram.

(1) (2)

Fig. 3. The histogram of a difference image

Fig. 4. The cumulative histogram and the interpolated lines

2.2.3 Background Buffer Update

The next step in the segmentation algorithm is to update the background image (BI). In the background registration method [8-13, 19, 44] because the performance of the segmentation result relies on the correctness of the background dramatically, we need a robust method to retrieve and maintain the background image. The simplest way to obtain the background image is to capture the background beforehand.

However, the background image may change slightly and gradually because the luminance may vary with time. In our algorithm, we dynamically update the background buffer using the stationary index (SI). The stationary index is used to record the possibility if a pixel is in the background region. Consider Eq.(2), the value of DM = 0 can be considered as background pixel and the value of DM =1 can be considered as moving pixel. Hence, the relationship between DM and SI can be stated in Eq.(3). If a pixel is not moving for many consecutive frames, the possibility to be a background pixel will be high. Therefore, we can use the SI value to update the background image (BI). The background update method is stated in Eq.(4), where BIC(i , j, t) denotes the value of color channel C of background image at position (i, j) at time t, IC(i, j, t) denotes the value of color channel C of video image at position (i, j) at time t. The value thm, which is pre-decided, means a pixel that is not moving for consecutive thm frames is considered to be a background pixel.

=

When the system starts up, the first video frame is set as background buffer to speed up the convergence of background image. After the setting step, we update the background buffer if SI is greater or equal to thm. The gradual variation can be updated to background buffer in thm frames. Even if there is a sudden lighting variation when the clouds are dispersed and the sun is revealed, the update equation can also catch up the variation in thm frames. Hence, the threshold thm determines the updating rate of background buffer. So, the threshold thm depend on the environment.

2.2.4 Background Differencing

After we obtain a background buffer, we can segment the moving objects from the background. Unlike the way adopted in computing the difference image, we use the luminance and chrominance channels together instead of using luminance only.

Because some moving objects look quite different compared to the background in their color, but the difference between the current frame and the background is almost zero when the chrominance information is discarded. In order to extract accurate object masks and not to miss any important moving objects, the chrominance information must be also considered. Because the importance of the chrominance channels depends on the intensity of the luminance channel, we design a difference score function to evaluate the difference in YUV color space. The different score, denoted as DS, is computed using the equations Eq.(5) through Eq.(7). For a pixel (i,j), we first get the strongest luminance intensity among the current frame and the background image. Then, we decide the weighting factor of the chrominance based on the luminance intensity. Because the valid range of the luminance channel after conversion is from 16 to 235, we can subtract the luminance value from 16 and divide it into 11 levels, which are from 0.0 to 1.0. After that, we can use the weighting sum equation Eq.(7) to compute the color distance in the YUV color space.

)

Sometimes an object enters a frame and then keeps staying in the same position on the xy-plane. We call such a kind of objects as ‘stopped object’ since the motions in both the x and y direction are almost zero. Because the algorithm updates the input frame to the background for the unchanged regions, the color of the stopped objects will be updated to background buffer when they stop too long. In this case, object regions will be false alarmed because the background has been incorrectly updated.

Although this problem can be solved by lengthening the interval of background updating, the time needed to adapt to the luminance variations is also lengthened.

Thus it is a tradeoff and both of the cases must be taken into consideration.

After we finish computing the background difference image using the difference score function, another dynamically selected threshold THb is applied to get the background difference mask (BDM), as shown in Eq.(9). The background difference mask extracted here indicates the moving object regions relative to the reference background image. However, the shadow is always included in the background difference masks. Thus, further filtering of shadow is required.

2.2.5 Shadow Elimination

Shadow is always extracted accompanying objects in change detection based algorithm. In order to extract a more precise object, we need to remove the shadow from object mask. The main idea to remove shadow tries to classify the object pixels into real moving objects or shadow [12-14]. Based on the mathematics analysis model in [12], we develop a new shadow elimination algorithm to remove shadow. Before introducing our shadow elimination algorithm, we need to explain the principle of color reflection. The principle of color reflection can be modeled as the multiplication of light energy and reflectance of an object. This principle can be expressed as the

following equation.

For frame objects, we use the change detection based algorithm to obtain moving and static objects. However, shadow is always extracted accompanying real moving objects. Thus, we can classify the current frame objects into three classes, real moving object, shadow of moving object and static background object. Since the color of real moving object is unknown, we neglect analyzing the color of moving object. Now, we use the Eq.(10) to model the background objects and shadows in current frame as shown in Eq.(11) and Eq.(12), respectively. The subscript B and S are used to distinguish background object and shadow. From the Eq.(4), we generate a background image. Thus, we also can use the Eq.(10) to model background image as in Eq.(13).

Now, we consider the relationship among Eq.(11), (12), (13). We can assume the background object is stationary, so the reflectance of background object and shadow in current frame is equal to the reflectance of object in background frame. Moreover, if there is no light changing, the light energy of the background object in current frame is equal to the light energy of the object in background frame. Since the light

energy of shadow is covered by moving object, the energy of the shadow is decreased.

Therefore, the difference of light energy between background object and shadow is a value K. Thus, we have the following relationship.

(15)

Based on the observation from the Eq.(11) to Eq.(15), if we substitute Eq.(14), (15) into Eq.(11), (12), (13), we have,

From Eq.(17), we know that if a pixel belongs to shadow, and then the value of Eq.(17) will be lower than 1. According to the Eq.(16) and Eq.(17), we develop a shadow elimination algorithm as shown in Fig.5 to remove shadow where th1 and th2

are the predecided thresholds.

Fig. 5. Shadow Elimination Algorithm

After shadow elimination, shadow is removed from object mask. However, the Input: The pre-extracted object masks.

Output: Object masks without shadow.

Procedure:

1. For each pixel(i, j) of object mask at time t;

2. Do

i. If th1 BIC(i, j, t)/IC(i, j, t) 1+ th2 , then remove pixel (i, j) from current object mask

ii. Else do nothing

object mask still contains a lot of noises and the object boundaries are not smooth.

Thus, further filtering of noise is required.

2.2.6 Morphological Operation

To smooth the object boundaries and remove the noises, two kinds of morphological operations are frequently used [8, 17]. The closing operation is first used to fill the black holes inside the object masks and the opening operation is then used to remove the small noises that do not belong to a moving object. In our algorithm, the structure element of size 7x7 and 5x5 are selected for closing and opening operations, respectively. In most of the cases, the smaller camera noises can be successfully filtered. However, larger regions caused by ghost effect are hard to remove out. Although larger structuring element may help, the computation cost is too expensive. Thus, instead of using larger structuring element, we will filter out these ghost regions in the video object tracking algorithm with temporal filtering.

After the morphological operations, the object mask is smoothened and indicates the shapes and the positions of all the moving objects in the current frame. The individual object in the object mask is then extracted in the next process.

2.2.7 Connected Component Labeling Algorithm and Size Filtering

Since object mask just simply indicates the positions and the shapes of all the moving regions without separate information, thus, each individual object in the object mask must be extracted and assigned an identifying label. The connected component algorithm is a frequently used algorithm [8] to achieve this work. For every pixel, it first examines the neighboring pixels and assigns that pixel a label.

Then, pixels with the same label or equivalent labels are clustered together to form an isolated object.

Because some large noise and ghost regions are hard to be completely removed out, the size filtering must be performed after the labeling process. The size filtering process filters out those regions which are smaller than a pre-defined threshold. The objects not filtered out are called the object-of-interests and are tracked in the video tracking module.

2.3 Video Object Feature Extraction

From the previous module, we obtain the current objects which are the objects in current frame. However, we do not know any content included in these object. Thus, we need to extract the current object features. From the past research, there are many features, such as, motion vector, texture, shape, color histogram and so on, can be used to describe objects. Actually, the more meaningful features are extracted; the object description will be more accurate. Due to real time requirement, we can only choose a few of features to describe objects. Through the strict screening, we require those features can be used to describe (a) the position of object, (b) the color of object and (c) the shape of object. As the reasons mentioned above, we select the following four features: (i) center of object, (ii) color histogram, (iii) variance of object and (iv) major and minor axis of best-fit-ellipse.

For the first feature, we use the center of object to indicate the position of an object. We calculate the center (Cx, Cy) of a moving object by using Eq.(18), where R is the region of moving object and, N is the pixel number of the region R.

For the second feature of color histogram, we divide each color channel into 16 bins instead of using a full color. Such a decision is based on two reasons. The first reason is full color information takes too much memory. For example, we need to use 256*256*256*4 (Byte/object) = 64 MB/object to describe an object by using full color feature. The second reason is slightly light changing cause the object color changing. We use Eq.(19) to calculate the color bin of a moving object where ri is the value of ith red color bin and initialize to zero; IR(i, j, t) is the red color value of the pixel (i, j) at time t. Similarly, we use the same method to calculate green color bin (gi) and blue color bin (bi).

For the third feature of object variance, we use the x,y variance of an object to describe the spatial distribution of an object. The Eq.(20) is used to calculate the

Since the third feature is lack of shape description, we select the fourth feature.

We use a best-fit-ellipse [15] to approximate the shape of the object instead of using a bounding box. This is because the object bounding box will change rapidly while the posture of an object just changes a little bit. For example, the comparison between bounding box and best-fit-ellipse is shown in Fig.6 and Fig.7. The increasing width of bounding box is about 200% and too much space is not belongs to object. Reversely, the increasing width of best-fit-ellipse is about 150% and the non-belonging space is

We use a best-fit-ellipse [15] to approximate the shape of the object instead of using a bounding box. This is because the object bounding box will change rapidly while the posture of an object just changes a little bit. For example, the comparison between bounding box and best-fit-ellipse is shown in Fig.6 and Fig.7. The increasing width of bounding box is about 200% and too much space is not belongs to object. Reversely, the increasing width of best-fit-ellipse is about 150% and the non-belonging space is

相關文件