Chapter 1 Introduction
1.4 Thesis Organization
This thesis is organized as follows. Chapter 2 describes the whole moving object tracking system for the dynamic background on active PTZ camera. That includes the flow chart and descriptions of all parts of the system. Chapter 3 introduces the moving object tracking algorithm. That includes the moving object detection, mean-shift tracking algorithm, color spaces transforming, and adaptive size tracking algorithm.
Chapter 4 describes the mechanism of an active PTZ camera, and the control method of tracking moving object. The experiment results which include the fix-size and adaptive-size moving object tracking are shown in Chapter 5. Finally, the conclusions and future works are made in Chapter 6.
2 Chapter 2
System Overview
The whole system contains two major parts: video sequence analysis and camera control part. These two parts are both implemented on PC. We analyze the image data captured by the pan-tilt-zoom camera on PC. We then use the analysis result to control the camera for moving object tracking. The main purpose of the video sequence analysis is to find out which one to tracking and judge its position and size in order to make a decision for controlling an active pan-tilt-zoom camera. It contains two steps for this part: change detection and mean-shift tracking. In change detection step, we use frame difference method to extract our interesting object. In mean-shift tracking, we use adaptive size mean-shift method for moving object tracking.
The main purpose of the camera control is to accurately control the active pan-tilt-zoom camera to track the target from part one. In this part, the control rule is to keep the object in the image center as quickly as possible. Furthermore, we can also control the function of zooming in/out by the result of the video sequence analysis part.
Figure 2-1 is the relationship between video sequence analysis part and camera control part, and Figure 2-2 is the complete flow chart of the whole overall system.
Figure 2-1 : Relationship between video sequence analysis and camera control
Figure 2-2 : Whole system Overview Capturing video sequence data
Sending position and size information
Video
Sequence Analysis
Camera
Control
2.1 System Architecture
2.1.1 Video Sequence Analysis
The main purpose of video sequence analysis is to extract the proper moving target object for non-rigidity moving object tracking under dynamic background.
There are two major elements: motion detection and mean-shift tracking as shown in Figure 2-3.
Figure 2-3 : The flow chart of video sequence analysis
I. Motion detection:
This part is to search a moving object for tracking by frame difference method.
The moving object with the largest area of difference will be chosen. If the selected moving object is large enough and has the reasonable size, the step of mean-shift tracking will be carried out. Figure 2-4 shows the flow chart of motion detection.
Frame Difference
Figure 2-4 : The flow chart of motion detection
There are three major components in the motion detection. They are frame difference, specific digital filter and image projection. We can find the frame difference area between the previous frame and current frame by frame difference method. We can eliminate some noises by image processing techniques, dilation and erosion, after the frame difference method. We can then specify the moving object region by the image projection part.
II. Mean-Shift Tracking:
In this part, the mean-shift tracking find the most probable target position in the current frame by comparing the corresponding information with that of the object in the previous frame. The histogram in RGB color space is typically adopted for the comparison. But in order to reduce the sensitive to illumination problems, we adopt the HSV color space which could be used for solving the problem of illumination.
In this tracking approach, we need a similarity measure which can measure the difference between two distribution obtained by target color distribution model and current tracking object color distribution model. A popular measure is the Bhattacharyya coefficient [25, 26]. After this measurement, if the similarity is over 80%, successful tracking will be accepted; otherwise tracking failure and then resume a new process of detection. The flow chart of mean-shift tracking system is as shown in Figure 2-5.
Figure 2-5 : The flow chart of mean-shift tracking system
2.1.2 Camera Control
The purpose of camera control is to drive the active camera to keep the target object in the center of the monitor screen, especially when moving object is going to diverge from the center of the screen. So we divide the screen into 9 regions (as shown in Figure 2-6). If the target moving object is located on region E, the camera
will be set to stop. Otherwise, if the tracking is located on other regions (A, B, C, D, F, G, H, I), the camera is set to active to the specific directions (as shown in Figure 2-7).
Figure 2-6 : Divide screen into 9 regions
A B C
D E F
G H I
Figure 2-7 : 8 control directions for each region
Furthermore, the functions of zoom-in and zoom-out will be activated if the size of moving object changes. For example, if the tracking size is 2 times larger than the original size, we will activate the zoom-out action. Otherwise, if the tracking size is 1/2 time smaller than the original size, we will activate the zoom-in action. During the period of the active camera moving, the tracking system is still running as well as the camera control system. So, if the moving object changes its direction, we can easily
2.2 Hardware Platform
In this system, we use an active camera Lilin PIH-7600 High Speed PTZ Dome as shown in Figure 2-8 that has pan-tilt-zoom function to acquire real time image frame sequences. These frames are captured and processed by Personal Computer (PC). The specification of the computer is Intel Pentium(R) 4 at 2.4 GHz, 512 Mb RAM in Windows XP OS. As shown in Figure 2-9, the active camera has two interfaces which are RS-232 and video capture interface. RS-232 interface is used to send a command by PC to control the camera movement including pan-tilt and zoom operation. Meanwhile, video capture interface is an analog input of video sequence through video capture card on PC to read out the image data. The video capture card is Vguard 7146. To process the image and send commands to control the PTZ camera, we use the Borland C++ Builder 6.0 as the platform.
Figure 2-8 : Lilin PIH-7600 High Speed PTZ Dome
Figure 2-9 : Hardware platform diagram
3 Chapter 3
Deformable Dynamic Tracking System
In this chapter, we will describe how to determine the location and area of moving object, and track it by mean-shift algorithm in image sequence. The tracking system is divided into two parts: moving object detection and man-shift tracking. We use both frame difference method for moving object detection and size adaptive mean-shift method for tracking.
3.1 Moving Object Detection
Among most moving object detection methods, the most commonly used methods are frame difference method and background subtraction method. We use the frame difference method because the scene of an active camera always changes, so we can not construct a stable background model for background subtraction. This moving object detection part consists of three elements, which are temporal difference, specific filter and image projection.
Temporal difference calculates the image difference between current and previous frame as shown in Figure 3-1(a) and Figure 3-1 (b) on the next page. After the temporal difference step, we can see the noises in the difference map (as shown in Figure 3-2). If we directly use this to select the tracking region for moving object tracking, the performance might be undesirable.
Figure 3-1 : (a) previous frame (left) (b) current frame (right)
Figure 3-2 : The result of frame difference in Figure 3-1
After we get the difference map, we use a specific digital filter of image dilation and erosion to eliminate the effect of noises on the difference map and then to enhance the actually moving object region. The result of applying this specific digital filter is shown in Figure 3-3 whose pixels are separated into two categories, moving pixels and non-moving pixels. We can compare Figure 3-2 with Figure 3-3 to see that the latter one is easier to recognize a moving object area for us.
Figure 3-3 : Difference map after erosion and dilation
We can get the more precise difference map after this specific digital filter, and take action to select our target object for tracking tasks. The method we adopt is the image projection method, which is commonly used in image processing. We need to do horizontal projection and vertical projection and according to these two graphs we can mark the moving object region for later tracking tasks. The result of image projection method is shown in Figure 3-4.
In extracting the moving object region, we first take horizontal projection to find the largest region on X coordinate, and then perform the vertical projection to find the largest region on Y coordinate. The result is a rectangular bounding box from left-top coordinate to right-bottom coordinate as shown in Figure 3-4.
After finding the largest region of the motion binary map, we will take it as our target object for the mean shift tracking task, if the area of the region is larger than some scale and satisfy the appropriate width/height ratio. Otherwise if the size is too small or the width/height ratio is not reasonable, we will choose to resume the whole step to find another appropriate candidate moving object.
Figure 3-4 : Image projection
Before we perform the tracking task, we need another step to reduce the effect of background noises is necessary. We adopt the elliptic region rather than rectangle region because we found that the rectangle region contains more background information than the elliptic region. The difference between the rectangle region and the elliptic region can be seen in Figure 3-5. In chapter 5, the experiment result will demonstrate the advantage of using the elliptic region.
Figure 3-5 : The rectangle region and the elliptic region
3.2 HSV Color Transformation
After we find the target object in the moving object detection part, we will track this target. However, we need to transfer the color space of our target object from RGB (Figure 3-6) to HSV (Figure 3-7) before tracking. The main purpose of this color space transformation is about the lightness problem of our system. Because the diaphragm of the active camera which we use will be changing by itself with the environment illumination, we have to avoid the effect of this varying factor.
Consequently, we solve this problem by using the HSV color space, because it can extract the lightness information from RGB color values, so we can reduce the sensitivity of this single quantity of illumination.
The HSV model, also known as HSB model, was created in 1978 by Alvy Ray Smith. It is a nonlinear transformation of the RGB color space. It defines a color space in terms of three components: hue, saturation, and value. The definition is described below: [27]
1. Hue: It is the color type and ranges from 0 ~ 360 degree. Each value corresponds to one color. For example, 0 is red, 45 is orange and 55 yellow.
When it comes to 360 degree, it is also equal to 0 degree.
2. Saturation: It is the intensity of the color, and ranges from 0%~100%. 0 means no color, and that means only gray value between black and white exists. 100 means the intense color with the most color variety.
3. Value: It is the brightness of the color, and also ranges from 0%~100%. 0 is always black. Depending on the saturation, 100 may be white or a more or less saturated color.
Figure 3-6 : RGB color model [28]
Figure 3-7 : HSV color model [29]
In our system, we transfer the RGB color space of the target object to HSV color space in order to prevent the brightness problem caused by the varying diaphragm of the camera, because the HSV color model separates the brightness information from the RGB color information. The transformation algorithm from RGB color model to HSV color model lists out step by step in the following paragraph:
1. We will transfer the RGB value from 0~255 to 0.0~1.0.
2. We select the maximum and minimum value of (R, G, B), and then adopt the following algorithm to transfer the RGB model to HSV model.
0 if MAX = MIN
6. In step 3, if MAX=MIN the value of H will consider as “Undefined.” In order to let our calculation more convenient, we define the H value to be 0 if it happens.
After execute these 6 steps to transfer the RGB color of our target object into HSV color, the next step of our system is to calculate the color histogram of the target object pixel by pixel. And the most important problem of this step is how many orders each element of HSV color model will be. This problem will be solve and explained in the Chapter 5.
3.3 Mean-Shift Tracking
The mean shift method is employed in the joint, spatial-range domain of gray level and color images for discontinuity preserving filtering and image segmentation before by goodness of its low cost and simplicity in computation. This method also presents a new approach to the real time tracking of non-rigid objects whose statistical distributions characterize the object of interest [16]. This theorem was proposed in 2000, and it also has been proved that it will converge consequently.
3.3.1 Mean-Shift Tracking Theorem
By definition, given a set { Xi }i=1…n of n points in the d-dimensional space Rd, the multivariate kernel density estimate with kernel K(x) and window radius h, computed in the point X0 is given by
where a commonly used kernel is the multivariate normal:
2 )
The kernel value of the candidate object at position Y can be described as:
∑
=Hence, tracking can be seen as searching the minimized distance of the sample based on the estimate of the Bhattacharyya coefficient given by:
∑
=After some derivations, we see that:
∑
So we can use this to minimize the distance by mean-shift algorithm.
3.3.2 Mean-Shift Tracking Algorithm
In our system, we use the mean-shift algorithm to track the object that we are interested in. After color model transformation, we use the histogram of HSV color space to excute the mean-shift tracking. In the context of tracking, given the color distribution functions of q and p generated from the model and candidate object regions, a sample corresponds to the color observed at a pixel X and has an associated sample weight w(x) which is defined by:
))
The key point of the mean-shift algorithm is the computation from initial object
where K(.) is usually a radically symmetric kernel function, such as a Gaussian distribution kernel function. We choose another popular kernel function called flat kernel. That is { K(x) = 1 , ||x|| < 1 }.
We will repeat the algorithm several times until it converges, and it will converge soon provably [16]. After its converging, we also need to check if the tracking result matches with our tracking target. We estimate the discrete density q={qu} u=1…m from the m-bin histogram of the target model, while p(y)={qu(y)}u=1…m is estimated at a given location y from m-bin histogram of the target candidate.
So the whole mean-shift algorithm can be briefly sorted below:
I. Transfer the HSV color model into distribution histogram on target model.
II. For the coming frames, get the HSV color histogram of the target candidate.
III. Calculate the mean-shift vector ΔX using the formula (3.12)
IV. After it converges, calculate the Bhattacharyya coefficient using the following equation (3.7) to see if the target candidate and the target model are matched.
V. If the target candidate is matched, perform the next adaptive size selecting method. Otherwise we consider the tracking is loss and search for another moving object.
3.4 Adaptive Size Selecting
After the mean-shift tracking step, the next step of our tracking system is the adaptive size selecting step, which is also our innovation. In the previous step, mean-shift tracking step, we can only track the target object in fix size matching.
Because we should track the target object as precise as possible, we need to change our tracking region with the varying size. Eventually, we find a solution for this problem by combining the mean-shift tracking with adaptive size selecting method.
Figure 3-8 : Three different size candidates
The adaptive size selecting method is also based on the mean-shift tracking algorithm. In general, the mean-shift tracking algorithm selects the fix size target candidate to calculate the mean-shift vector ΔX. Here we use the different tracking sizes of target candidate, and we calculate the Bhattacharyya coefficient for each size.
As shown in Figure 3-8, after we calculate the position of new tracking object, we will choose several candidates with different size.
There are three different size candidates in Figure 3-8, the smaller candidate, the normal candidate and the larger candidate. Then we need to compare their Bhattacharyya coefficients to choose which one has the largest similarity. Before that, the most important thing is to normalize the color distribution of each tracking size of the candidate, because the Bhattacharyya coefficient which we use for judging the similarity of each candidate must have the same numbers of sample weight. So the each bin of color distribution of target candidate p(y)={qu(y)} u=1…m will be multiplied a normalized coefficient f.
f = number of target model pixels / number of target candidate model pixels Therefore the Bhattacharyya coefficient calculating equation will be modified as:
∑
=We compare the Bhattacharyya coefficients calculated by equation (3.10) of three different sizes of target candidate, and then we select the one which has the largest Bhattacharyya coefficient as our final tracking result.
In the Figure 3-9 and Figure 3-10 on next page, we demonstrate the result of size selecting on single picture. Figure 3-9 is retrieved on Internet and Figure 3-10 is enlarged from Figure 3.9 by Photoshop. In the Figure 3.10, the tracking region is chosen by hand and the right one is the target object. We use both the mean-shift tracking and size selecting method to calculate the correct position and size. In Figure 3-10, after 4 times iteration we can see the tracking result is good and the size selecting is also matched.
Figure 3-9 : Tracking region is chosen by hand
Figure 3-10 : The tracking result after 4 times iteration.
This selecting method also has some problems that might occur. They would happen in some video sequence which leads to the size selecting always diverging.
That means the size selecting step always choosing the larger region or choosing the smaller tracking region. Both situations may cause the size selecting diverging and tracking failure. We can not simulate this kind of situation in only one picture because that might happen in many frames later. But we can handle this by the following two methods:
1. We can try to change the camera zoom to let the camera reset its focus and refine the image resolution of the target object. This method might be useful when the distance between the target object and the camera
varies very often, but this method still might be failure possibly.
2. We can restrict the size selecting times in our coding program such as only 3 times larger than the target object or only 3 times smaller than
2. We can restrict the size selecting times in our coding program such as only 3 times larger than the target object or only 3 times smaller than