基於多團塊模型及平均移動法之物體追蹤技術

全文

(1)國立交通大學電子工程學系電子研究所碩士班碩. 士. 論. 文. 基於多團塊模型及平均移動法之物體追蹤技術. Mean-Shift Object Tracking Based on a Multi-Blob Model. 研究生：姚文翰指導教授：王聖智博士. 中華民國九十五年六月.

(2) 基於多團塊模型及平均移動法之物體追蹤技術 Mean-Shift Object Tracking Based on a Multi-Blob Model. 研究生：姚文翰. Student：Wen-Han Yao. 指導教授：王聖智博士. Advisor：Dr. Sheng-Jyh Wang. 國立交通大學電子工程學系電子研究所碩士班碩士論文. A Thesis Submitted to Department of Electronics Engineering & Institute of Electronics College of Electrical Engineering and Computer Science National Chiao Tung University in partial Fulfillment of the Requirements for the Degree of Master in Electronics Engineering June 2006 Hsinchu, Taiwan, Republic of China. 中華民國九十五年六月.

(3) 基於多團塊模型及平均移動法之物體追蹤技術研究生：姚文翰. 指導教授：王聖智博士. 國立交通大學電子工程學系電子研究所碩士班. 摘要. 在本文中，我們提出一套能夠自動偵測畫面中移動物體並持續追蹤的演算法，並嘗試解決在物體追蹤問題中常遭遇的遮蔽、場景變化以及光源變化等等問題。我們自行定義一個多團塊模型用以描述移動物體，並基於多團塊模型定義適合的相似性度量，進而發展出以平均移動法為基礎的追蹤系統。我們也針對移動物體在畫面中的大小以及方向性提出一套調整方式。整個系統還包括模型更新、目標丟失等等判斷機制，讓追蹤結果更加合理、強韌。實驗結果顯示我們提出來的演算法在室內、室外等不同場景都能夠正確地追蹤移動物體的運動行為。. i.

(4) Mean-Shift Object Tracking Based on a Multi-Blob Model Student : Wen-Han Yao. Advisor : Dr. Sheng-Jyh Wang. Department of Electronics Engineering, Institute of Electronics National Chiao Tung University. Abstract. In this thesis, we proposed an object tracking system, which can automatically detect a single moving object in an image sequence and keep tracking of this object. In the proposed system, we deal with the problems of occlusion, scene change and luminance change. A multi-blob model is defined in our approach to represent the moving object. With this multi-blob model, we proposed a new similarity measure and developed a new object tracking algorithm based on the mean-shift method. We also proposed a strategy to update the size and orientation of the bounding ellipse of the moving object. For the sake of robustness, the proposed system contains decision criteria to handle model updating and loss of target. Simulation results demonstrate that the proposed object tracking algorithm can faithfully track the moving object in different scenes.. ii.

(5) 誌謝特別感謝我的指導教授王聖智老師，除了在學術上悉心教導外，同時也是我們學習待人處事的典範。感謝實驗室的全體夥伴，因為有你們在生活上的協助及學業上的討論，這篇論文才得以順利完成。最後要感謝我親愛的家人，因為他們的愛，鼓舞遇到瓶頸的我，仍能打起精神，往前邁進。. iii.

(6) Content 摘要.............................................................................................................................................. i Abstract ....................................................................................................................................... ii 誌謝............................................................................................................................................iii Content ....................................................................................................................................... iv List of Figures ............................................................................................................................. v Chapter 1. Introduction ............................................................................................................... 1 Chapter 2. Backgrounds of Object Tracking............................................................................... 2 2.1 Motion Detection........................................................................................................... 2 2.1.1 Background Subtraction..................................................................................... 2 2.1.2 Temporal Differencing ....................................................................................... 3 2.1.3 Optical Flow....................................................................................................... 4 2.2 Motion Tracking............................................................................................................ 6 2.2.1 Region-Based Tracking ...................................................................................... 6 2.2.2 Active Contour-Based Tracking ......................................................................... 7 2.2.3 Model-Based Tracking ....................................................................................... 8 2.2.4 Feature-Based Tracking...................................................................................... 9 Chapter 3. Mean-Shift Tracking and Multi-Blob Model........................................................... 15 3.1 Traditional Mean-Shift Tracking................................................................................. 15 3.1.1 Building Models and Measuring Similarity ..................................................... 15 3.1.2 Object Tracking Procedure............................................................................... 17 3.1.3 Discussion ........................................................................................................ 22 3.2 Multi-Blob Model Based Mean-Shift Tracking .......................................................... 23 3.2.1 Moving Object Detection ................................................................................. 23 3.2.2 Multi-Blob Model ............................................................................................ 28 3.2.3 Similarity Measure ........................................................................................... 31 3.2.4 Mean-Shift Tracking Procedure ....................................................................... 33 3.2.5 Updating Size and Orientation ......................................................................... 35 3.2.6 Updating Reliability ......................................................................................... 41 3.2.7 Updating Target Model..................................................................................... 44 3.2.8 Overall Object Tracking Process...................................................................... 46 Chapter 4. Experimental Results............................................................................................... 48 Chapter 5. Conclusions ............................................................................................................. 60 Reference................................................................................................................................... 61. iv.

(7) List of Figures Figure 2.1 Effect of the second stage of detection on suppressing false detections. (a) Original image. (b) First stage detection result. (c) Suppressing pixels with high displacement probabilities. (d) Result using component displacement probability constraint. [4] ............................................................................................................ 3 Figure 2.2 A human body is considered as a combination of blobs. (a) Original image. (b) A two dimensional representation of the blob statistics. [6]......................................... 7 Figure 2.3 Frames from a sequence of a person who drops an object. [7].................................. 7 Figure 2.4 Skeletonization of moving targets. The structure and rigidity of the skeleton is significant in analyzing target motion. [8] ................................................................ 8 Figure 2.5 (a) The tree hierarchy. (b) Tracking skeletons in 3-D: front and recovered side-views. [10] ......................................................................................................... 9 Figure 2.6 Example of a model graph which contains boundary cells and internal cells. [11] 10 Figure 2.7 (a) The similarity surface, the initial and final locations of mean-shift iterations. (b) Tracking results. [12] .............................................................................................. 11 Figure 2.8 The spatiogram captures spatial relationships among various colors, whereas the histogram discards all spatial information. (a) Original image. (b) Statistical distribution generated from the histogram of (a). (c) Statistical distribution generated from the spatiogram of (a). [14] ............................................................. 12 Figure 2.9 Using the scale-space mode-tracking method. The person is tracked well, both spatially and in scale. [15]....................................................................................... 13 Figure 2.10 Updating both size and orientation through motion tracking. [16]........................ 13 Figure 2.11 A flow chart in [17], which describes how to obtain the feature score.................. 14 Figure 2.12 Example of feature adaptation to avoid distractions. Left column: video frame with the object/background windows overlaid. Right column: weight image from top-ranked tracking features. [17] ........................................................................... 14 Figure 3.1 Flow chart of building models and measuring similarity. ....................................... 16 Figure 3.2 (a) The moving object to build the target model (Frame 15, sequence Hans). (b) The red point is the starting point y0 of the mean-shift process, and the red bounding box represents the range of consideration corresponding to y0 . The blue point y is the ideal tracking result (Frame 250, sequence Hans)............................. 17 Figure 3.3 Mean-shift tracking flow ......................................................................................... 20 Figure 3.4 (a) The moving object to build target model (Frame 15, sequence Hans). (b) The mean-shift process starts from the red point y0 . The blue point y is the ideal tracking result. However, the mean-shift tracking result converges to the green. v.

(8) point y1 instead (Frame 250, sequence Hans). ..................................................... 20 Figure 3.5 (a) The target model. (b) The color histogram and the similarity value with respect to tracking result y1 . (c) The color histogram and the similarity value with respect to the starting point y0 . (d) The color histogram and the similarity value with respect to the expected tracking result y ................................................................. 21 Figure 3.6 The similarity surface, the initial and final locations of the mean-shift process. y is respect to the expected tracking result................................................................. 22 Figure 3.7 (a) Frame 75 in the Watson sequence. (b) Frame 85 in the Watson sequence. ........ 23 Figure 3.8 Flow of calculating the statistics. The statistics is used to represent the covariance ellipse. ..................................................................................................................... 24 Figure 3.9 Relation between ⎡⎣σ x σ y. ρ ⎤⎦ and [ a b θ ] . ................................................. 25. Figure 3.10 Using a bounding ellipse to define the moving object........................................... 27 Figure 3.11 Two different patterns which a spatiogram can not differentiate........................... 28 Figure 3.12 Five blobs that appear most frequently are marked with their mean values in the RGB color space...................................................................................................... 30 Figure 3.13 (a) The moving object detected in Frame 15 of the Hans sequence. Build a multi-blob model according to the red bounding ellipse. (b) The region with the maximum similarity with the model in Frame 250 of the Hans sequence. (c) The similarity surface where y is the center of bounding ellipse shown in (b). ......... 32 Figure 3.14 The foreground region and the background region based on a bounding ellipse. . 35 Figure 3.15 Result of size and orientation updating. The blue ellipses represent the 2-sigma contour and the 3-sigma contour with respect to the original covariance matrix. The red ellipse represents the 2-sigma contour with respect to the modified covariance matrix. ..................................................................................................................... 40 Figure 3.16 (a) The outer ellipse represents the 3-sigma contour, and the inner ellipse represents the 2-sigma contour with respect to the modified covariance matrix. (b) Color histogram of the background field and foreground field. ............................. 41 Figure 3.17 Reliability function, both x-axis and y-axis are in the log scale............................ 42 Figure 3.18 The reliability map, which can roughly identify the moving object...................... 43 Figure 3.19 Frames with pixels marked depend on reliability. (a) Frame 50, Sequence Watson. (b) Frame 65, Sequence Watson. ............................................................................. 44 Figure 3.20 Flow chart of the proposed object tracking process. ............................................. 46 Figure 4.1 Experimental results of the sequence “Hans”. The orientation of the bounding ellipse and the reliability of blobs can be updated during tracking......................... 51 Figure 4.2 The reliability of Blobs 147 and 148. They are updated according to the background information. ............................................................................................................. 51 Figure 4.3 The tracking result of sequence Watson. The target model updates at frame 65. .... 54 Figure 4.4 The variance ratio of tracking result. ....................................................................... 55 vi.

(9) Figure 4.5 The localization of the traditional mean-shift process is poor when the object’s size decreases.................................................................................................................. 56 Figure 4.6 Number of iterations when executing the proposed mean-shift process (red) and the traditional mean-shift procedure (blue)................................................................... 57 Figure 4.7 The experimental result, where the red ellipse indicates the target. Due to the target loss, the algorithm detected the object at Frame 75, Frame 185 and Frame 220. We switch the color from red to green at these frames. ................................................ 59. vii.

(10) Chapter 1. Introduction The goal of object tracking is to find a specific target in successive frames of an image sequence. Various algorithms for object tracking have be proposed in recent years. Among these tracking algorithms, the mean-shift method has been popularly used due to its robustness and simplicity. This iterative ‘mean-shift’ process is a simple robust technique for the finding of the local maximum position without knowing the overall distribution. Recently, Comaniciu and Meer [1] successfully applied this mean-shift method to object tracking problems. So far, in object tracking problems, mean-shift is used to find the position which has the maximal similarity with the target. However, this kind of approach has several serious defects. First, the spatial information of the target is lost and this fact causes the poor localization of the tracking result. Second, the commonly used similarity measures, such as the Bhattacharyya coefficient or the Kullback-Leibler distance, are not very discriminative. Third, the size and orientation of the tracked object cannot be updated in an efficient way. In this thesis, we propose a new target model to represent the moving object and define a new similarity measurement based on the target model. We call this concept a multi-blob model. We derive a mean-shift process for the multi-blob model and demonstrate the improved tracking results. To make the system more complete, we develop a simple motion detection process to roughly detect the location and size of the moving region. Moreover, we also present a method to update both size and orientation of the bounding ellipse of the tracked object. The thesis is organized as follows. In Chapter 2, we introduce the background of motion detection and motion tracking. In Chapter 3, we present the mean-shift process and discuss its drawbacks. Furthermore, we present our new target model, the multi-blob model, and our object tracking procedure. Some experimental results are given in Chapter 4. In Chapter 5 we draw our conclusions.. 1.

(11) Chapter 2. Backgrounds of Object Tracking In general, many previously proposed approaches for motion detection and motion tracking are conceptually similar. It is fairly difficult to have a clear cut between these two issues. However, there exist some intrinsic differences between motion detection and motion tracking. In this chapter, motion detection and motion tracking are treated as two individual parts and will be discussed separately in the next two sections.. 2.1 Motion Detection Nearly every visual surveillance system starts with motion detection. Motion detection aims at segmenting regions corresponding to moving objects in an image. A good motion detection result usually makes the following motion tracking process much easier. We can roughly divide existing motion detection techniques into three major categories: background subtractions, temporal differencing, and optical flow.. 2.1.1 Background Subtraction Background subtraction is a popular method for motion detection. It detects moving regions in an image by taking the difference between the current image and the reference background model in a pixel-by-pixel fashion. This method is extremely sensitive to dynamic changes caused by lighting change or shadows in the monitored scenes. Therefore, a reliable background model is in great demand to reduce the influence of these changes. That is, an active construction and updating of the background model are indispensable to visual surveillance. For fixed cameras, the key problem is to automatically recover and update the background model based on a sequence of dynamic images. Unfavorable factors, such as illumination variance, shadows and shaking branches, bring difficulties to the acquirement and updating of background images. Some simple implementations use the time averages of image data, the adaptive Gaussian estimation, or the Kalman filtering to derive the background model. While these methods can run in real time, they are generally not robust enough. In [1], the authors consider each pixel as an independent statistical process, and record the observed intensity at each pixel over the previous n frames. The statistical distribution of the observed samples is then optimally fit to the model of a mixture of Gaussian functions. This approach assumes that the temporal behaviors of the intensity/color value at an image pixel are likely to follow 2.

(12) the normal distributions. Moreover, this approach assumes there could be more than one possible states at an image pixel when we do the observation over time. A PTZ camera is a camera that can pan, tilt, and zoom. The scene captured by an active PTZ camera has non-stationary background. As a PTZ camera moves with respect to a rigid scene, the image content changes over time. In this case, motion compensation is needed to construct temporary background models. In [3], image mosaicing techniques are used to build a panoramic representation of the scene background. Alternatively, in [4], a representation of the scene background in terms of a finite set of images on a virtual polyhedron is used to construct images of the scene background at any pan-tilt-zoom setting. On the other hand, the kernel density function is proposed in [5] to estimate the ensemble characteristics of sample data to produce the background model. This model keeps a sample of intensity values for each pixel in the image and uses this sample to estimate the density function of the pixel intensity distribution. Therefore, the model can estimate the probability of newly observed intensity. This model is able to handle the scene that is not completely stationary but contain small motions, like wavering tree branches.. (a). (b). (c). (d). Figure 2.1 Effect of the second stage of detection on suppressing false detections. (a) Original image. (b) First stage detection result. (c) Suppressing pixels with high displacement probabilities. (d) Result using component displacement probability constraint. [5]. 2.1.2 Temporal Differencing The pixel-wise differences between contiguous frames in an image sequence are used in the temporal differencing technique. There are many variants on the temporal differencing 3.

(13) method, and the simplest one is to calculate the absolute difference and to use a threshold function to detect the changes [6]. One problem with the temporal differencing technique is that the detection result tends to include undesirable background regions. These regions represent the positions where the target appears in the previous frame. One way to solve this problem is to introduce the knowledge of the target’s motion to remove these background regions from the template. This kind of method is usually called “motion cropping”. An IIR filter is usually used to update the template to ensure the current template may represent the target accurately.. 2.1.3 Optical Flow Optical flow is one of the most common approaches for motion estimation. In this approach, an image sequence is treated as a function f ( x, y , t ) , where x and y represent the spatial coordinates and t represents the temporal coordinate. It is assumed that the intensity value projected from a three dimensional point onto the image plane is unchanged all the time, even if the three dimensional point is under movement. This assumption can thus be expressed as. df ( x, y, t ) = 0. Eq. 2-1 dt Because ( x, y ) is also a function of t, we can apply the chain rule over the above equation. We may then rewrite Eq. 2-1 as ∂f ( x, y, t ) ∂f ( x, y, t ) ∂f ( x, y, t ) = 0. u ( x, y , t ) + v ( x, y , t ) + ∂x ∂y ∂t. Eq. 2-2. This equation is usually called the optical flow equation or optical flow constraint. If we define V = (u, v) , which represents the flow vector at each point, we can further reformulate the equation into the vector form: ∇f ( x, y , t ) , V ( x , y , t ) +. ∂f ( x, y, t ) =0, ∂t. ⎡∂ where ⋅ is the inner product operator and ∇ = ⎢ ⎣ ∂x. Eq. 2-3 ∂⎤ ∂y ⎥⎦. T. means the gradient operation.. Due to the fact that the inner product of ∇f ( x, y, t ) and the part of V ( x, y, t ) that is perpendicular to ∇f ( x, y, t ) will be zero, the optical flow approach cannot estimate the motion when the moving direction is perpendicular to the intensity gradient of the object. Since the gradient operator is sensitive to noise, some researches also apply the Gaussian filtering along the spatial axis and the temporal axis. Optical flow detects motions only based on the intensity change. Hence, the detection may be unreliable. Two typical examples are the unobservable motion and the fake motion. When there is no obvious intensity change within the moving object, unobservable motion 4.

(14) happens. On the other hand, as the external illuminating condition changes, the intensity of a stationary object may change over time and fake motion may occur.. 5.

(15) 2.2 Motion Tracking After motion detection, surveillance systems generally track moving objects from one frame to another in an image sequence. Although there are many researches trying to deal with the motion tracking problem, existing techniques are still not robust enough for stable tracking. In general, real-time execution is needed for a practical surveillance system; however, so far it is still hard to achieve high-resolution image quality under the time constraint. There still exist many problems in the motion tracking field. In order to successfully track objects in an image sequence, various types of information are usually used to match an object in an image with the same object in another image. We can roughly classify the motion tracking techniques into a few categories in accordance with the used information. In the following sections, we’ll introduce a few major types of motion tracking techniques. However, it is worth mentioning that a motion tracking process may use more than one kind of information and various kinds of information can be integrated together.. 2.2.1 Region-Based Tracking Region-based tracking algorithms track moving objects based on variations of the image regions corresponding to moving objects. For these algorithms, the background model is updated dynamically and motion regions are usually detected by subtracting the background from the current image. Hence, these algorithms use static cameras, instead of active cameras, because of the computational complexity in updating the background model. In [7], the authors proposed an algorithm, which uses small blob features to track a single human body in an indoor environment. A human body is considered as a combination of several blobs. Each blob represents one body part, such as head, torso, or limb. Moreover, both human body and background are modeled as Gaussian distributions to represent the intensity value of every pixel. Finally, the pixels belonging to the human body are assigned to various blobs using the log-likelihood measure. Therefore, by tracking each small blob, a moving human object can be successfully tracked.. 6.

(16) (a). (b). Figure 2.2 A human body is considered as a combination of blobs. (a) Original image. (b) A two dimensional representation of the blob statistics. [7] Although region-based algorithms usually use background subtraction to obtain moving regions, the shadow may cause false detection. To avoid false detection, [8] proposed an adaptive background subtraction method, in which color and gradient information are combined to cope with shadows and unreliable color cues in motion detection. Tracking is then performed at three levels of abstraction: regions, people, and groups. Regions can merge and split. A human is composed of one or more regions, which are grouped together under the condition of geometric constraints. On the other hand, a human group consists of one or more people. Therefore, using the region tracker and the individual color appearance model, we can deal with person-to-person and person-to-object interactions.. Figure 2.3 A sequence of images, in which a person drops an object. [8]. 2.2.2 Active Contour-Based Tracking Active contour-based tracking algorithms track objects by representing their outlines as bounding contours and by updating these contours dynamically for every frame. To find the bounding contour of the moving object, background subtraction is often applied. Nevertheless, no motion detection algorithm is perfect. There will be spurious pixels and holes in the detected moving features. In [9], a morphological dilation followed by an erosion operation is used to solve this problem. With the morphological operation, the bounding contour of the 7.

(17) moving object changes. This approach then skeletonizes the boundary to build a star representation for the moving object. By analyzing the torso angle and the star’s periodic motion, simple behavior recognition can be achieved [9] .. Figure 2.4 Skeletonization of moving targets. The structure and rigidity of the skeleton is significant in analyzing target motion. [9] To find the complete contour of moving objects, the active contour approach, which is commonly called “snake”, is widely used. Snakes are deformable contours that move under the influence of image-intensity forces, subject to certain internal deformation constraints. In [10], the authors proposed a kalman snake model in the spatio-velocity space to track non-rigid moving targets. In contrast to region-based tracking algorithms, these active contour-based algorithms describe objects in a simpler and more effective way and can thus reduce computational complexity.. 2.2.3 Model-Based Tracking Model-based tracking methods build a model in advance and match the moving object to the model. A motion model is also established to incorporate with a search strategy. To track different objects, many types of models have been proposed. In [11], a hierarchical model is proposed to describe human dynamics. They regard the transition from one pose to the next as the dynamics of the action, and encode this transition using a hidden Markov model (HMM). In this approach, the models, both poses and dynamics, are trained from real data. Then, they describe the model of valid poses, and then move on to describe the HMMs for dynamics. This hierarchical model tracks skeletal poses, this tracking method is largely independent of image modality.. 8.

(18) (a). (b) Figure 2.5 (a) The tree hierarchy. (b) Tracking skeletons in 3-D: front and recovered side-views. [11]. 2.2.4 Feature-Based Tracking In contrast to model-based tracking methods, feature-based tracking builds a model according to the moving object’s features. There are lots of features that can help us in tracking objects, like edges, corners, color distribution, skin tone and human eyes. An active template which characterizes regional and structural features, such as texture, shape and color, is proposed to track moving objects in [12]. In this approach, the authors design an energy function and adapt the model dynamically by minimizing the energy in order to track non-rigid targets.. 9.

(19) Figure 2.6 Example of a model graph which contains boundary cells and internal cells. [12] In recent years, an efficient algorithm called “mean-shift” is widely used for motion tracking. In [1], a target model is constructed by calculating moving object’s color histogram. In other words, the target model uses the color distribution status as features. The target model is defined as n. (. qu = C ∑ k xi i =1. 2. )δ ⎡⎣b ( x ) − u ⎤⎦ ,. Eq. 2-4. i. m. where C is the normalization constant to ensure. ∑ qˆ u =1. u. ( ). = 1 , and k x. 2. is the kernel profile.. Similarly, a target candidate is defined as. ⎛ y − xi pu ( y ) = Ch ∑ k ⎜ ⎜ h i =1 ⎝ nh. 2. ⎞ ⎟⎟δ ⎡⎣b ( xi ) − u ⎤⎦ . ⎠. Eq. 2-5. The Bhattacharyya coefficient is used to derive the similarity between the target model and the target candidate. The coefficient is defined as m. ρ ( y ) ≡ ρ ⎡⎣ p ( y ) , q ⎤⎦ = ∑ pu ( y ) qu .. Eq. 2-6. u =1. After the maximization of the similarity function, we have. ⎛ y0 − xi 2 ⎞ ∑ i=1 xi wi g ⎜⎜ h ⎟⎟ ⎝ ⎠ y1 = , 2 ⎛ y0 − xi ⎞ nh ∑ i =1 wi g ⎜⎜ h ⎟⎟ ⎝ ⎠ nh. m. where wi = ∑ u =1. Eq. 2-7. qu δ ⎡b ( xi ) − u ⎤⎦ , and g ( x ) = − k ' ( x ) represents the shadow kernel. In pu ( y0 ) ⎣. this procedure, the kernel is iteratively moved from the current location y0 to the new location y1 according to Eq. 2-7. 10.

(20) (a). (b) Figure 2.7 (a) The similarity surface, the initial and final locations of mean-shift iterations. (b) Tracking results. [1] Although the mean-shift approach is relatively an efficient and robust method for motion tracking, there are still some drawbacks. In recent years, many researches try to reform the mean-shift algorithm based on the following issues. (a) Accuracy improvement.. In [13], the authors design some experiments to figure out whether the adopted similarity measure is appropriate or not. The simulations indicate that the Bhattacharyya and K-L distances are inaccurate in higher dimensions and the computations in higher dimensions are instable in the sense that repeated computations using different samples may yield varying results. Hence, they redefine the target model as. 1 qˆ x ( x , u ) = N. ⎛ x−x ∑ i =1 w ⎜⎜ σ i ⎝ N. 2. ⎞ ⎛ u − ui ⎟⎟ k ⎜⎜ ⎠ ⎝ h. 2. ⎞ ⎟⎟ , ⎠. Eq. 2-8. where x represents the spatial location and u is the corresponding feature vector. 11.

(21) Similarly, the target candidate is redefined as. 1 pˆ y ( y, v ) = M. ⎛ y − yj w ∑ j =1 ⎜⎜ σ ⎝ M. 2. ⎞ ⎛ v − vj ⎟k ⎜ ⎟ ⎜ h ⎠ ⎝. 2. ⎞ ⎟. ⎟ ⎠. Eq. 2-9. The similarity between the target model and the candidate in the joint feature-spatial space is defined to be. 1 J ( px , q y ) = MN. ⎛ xi − y j ⎜ w ∑∑ ⎜ σ i =1 j =1 ⎝ N. M. 2. ⎞ ⎛ ui − v j ⎟k ⎜ ⎟ ⎜ h ⎠ ⎝. 2. ⎞ ⎟. ⎟ ⎠. Eq. 2-10. Even though the experimental results show that the tracking accuracy is greatly improved, the computational complexity for the direct evaluation of this similarity measure could be very high. (b) Usage of spatial information. All the aforementioned target models lack spatial color distribution information. In order to keep more useful features, [14] proposed a new representation, called spatiogram, for the target model and the candidate. Spatiograms offer a richer representation and may capture not only the values of the pixels but also their spatial relationships. In their approach, a spatiogram is defined as. hp ( b ) = nb , μb , Σb ,. b=1,......,M ,. Eq. 2-11. where nb is the number of pixels in the bth bin, and μb and Σb represent the mean vector and covariance matrices, respectively. A mean-shift process is also developed to do motion tracking. Simulations in [13] show that this spatiogram approach offers more robust tracking than the traditional histogram-based approaches.. (a). (b). (c). Figure 2.8 The spatiogram captures spatial relationships among various colors, whereas the histogram discards all spatial information. (a) Original image. (b) Statistical distribution generated from the histogram of (a). (c) Statistical distribution generated from the spatiogram of (a). [14]. 12.

(22) (c) Scale and orientation selection Scale selection is also a problem. If the kernel size is too large, the tracking window will contain many background pixels as well as foreground object pixels. Since in this case the data histogram will get polluted with background data, this large kernel window may cause incorrect tracking. On the contrary, choosing a kernel size that is too small may suffer from poor object localization. In [1], the “plus or minus 10 percent” scale adaptation method is used to estimate the optimized scale, but the computational complexity is three times than before. In [13] and [15], the authors treat the scale as a variable in the tracking algorithm and update the scale by applying the mean-shift procedure through the scale axis.. Figure 2.9 Scale-space mode-tracking method. The person is tracked well, both spatially and in scale. [15] On the other hand, the orientation of target is also important. Bad orientation estimation will cause the target information to be polluted by noisy background pixels. Rectangular bounding box can’t help in estimating object’s orientation. In [16], the authors use a bi-variant Gaussian profile as the kernel in the mean-shift procedure. By calculating the covariance matrix of all pixels belonging to the moving object, we can obtain the orientation and scale at the same time.. Figure 2.10 Updating both size and orientation in motion tracking. [16] (d) Feature selection Object tracking is cast as a local discrimination problem to distinguish foreground objects from the background. During the tracking process, distractions due to background could easily distract the mean-shift window and cause failure in tracking. In [17], the authors develop a strategy to select features that can best discriminate foreground pixels from background pixels. To quantify the discrimination of a feature, a two-class variance ratio is used as the feature score. Base on the feature score, the most discriminative features can be 13.

(23) chosen for a more robust tracking.. Figure 2.11 A flow chart in [17], which describes how to obtain the feature score.. Figure 2.12 Example of feature adaptation to avoid distractions. Left column: video frame with the object/background windows overlaid. Right column: weight image from the top-ranked tracking features. [17]. 14.

(24) Chapter 3. Mean-Shift Tracking and Multi-Blob Model Mean-shift is a powerful mathematical tool for object tracking problem, however, there still exist many drawbacks as mentioned in Section 2.2.4. In this chapter, we propose a new mean-shift-based approach trying to avoid these shortcomings.. 3.1 Traditional Mean-Shift Tracking Mean-shift is a technique which can find the location of local maximum without knowing the overall distribution. In object tracking problems, mean-shift is usually used to find the local maximum of the similarity surface.. 3.1.1 Building Models and Measuring Similarity To define the similarity between two objects, we have to define the features of an object first. Recall Eq. 2-4 and Eq. 2-5, which are used to build models. In [1], the object is represented using a 16 × 16 × 16 histogram in the RGB space and the model becomes. ( )δ ⎡⎣b ( x ) − u ⎤⎦ , u = 1,...,16 ×16 ×16 ,. qu = C ∑ k xi i∈I. 2. i. Eq. 3-1. where I represents the set of pixels belonging to the object, and k ( i ) is the kernel function for the spatial information. Based on the model designed to represent features of the object, we may choose an appropriate measurement to evaluate the similarity. In [13], the histogram is considered as a distribution in the feature space. Exiting mean-shift trackers use the Kullback-Leibler distance and Bhattacharyya distance to measure the similarity between distributions. The Kullback-Leibler distance between two distributions is define as. D( y) =. 16×16×16. ∑ u =1. pu ( y ) log. pu ( y ) . qu. Eq. 3-2. On the other hand, the Bhattacharyya distance is. 15.

(25) B ( y) = 1−. 16×16×16. ∑ u =1. pu ( y ) qu .. Eq. 3-3. The figure shown below represents the flow of calculating similarity.. Figure 3.1 Flow chart of building models and measuring similarity.. 16.

(26) 3.1.2 Object Tracking Procedure [1] uses the Bhattacharyya distance to measure similarity. In Eq. 3-3, minimizing the distance is equivalent to maximizing the Bhattacharyya coefficient, where the coefficient is expressed in Eq. 2-6. The search for the new target location in the current frame starts at the location y0 of the target in the previous frame. As shown in Figure 3.2(b), the location y has the maximum similarity.. (a). (b). Figure 3.2 (a) The moving object to build the target model (Frame 15, sequence Hans). (b) The red point is the starting point y0 of the mean-shift process, and the red bounding box represents the range of consideration corresponding to y0 . The blue point y is the ideal tracking result (Frame 250, sequence Hans).. Since the motion between two frames is small, y is usually close to y0 but the precise location is unknown. Thus, we may use the first order Taylor expansion to represent the Bhattacharyya coefficient ρ ( y ) in terms of ρ ( y0 ) .. 17.

(27) m. ρ ( y ) = ∑ pu ( y ) qu u =1. m. ≈∑ u =1. =. 1 m pu ( y0 ) qu + ∑ 2 u =1. qu ( pu ( y ) − pu ( y0 ) ) pu ( y0 ). 1 m 1 m + p y q pu ( y ) ( ) ∑ u 0 u 2∑ 2 u =1 u =1. qu pu ( y0 ). C 1 m = ∑ pu ( y0 ) qu + h 2 u =1 2. ⎛ y − xi k ⎜ ∑∑ ⎜ h i =1 u =1 ⎝. C 1 m = ∑ pu ( y0 ) qu + h 2 u =1 2. ⎛ y − xi wi k ⎜ ∑ ⎜ h i =1 ⎝. m. where wi = ∑ u =1. nh. 2. m. nh. 2. ⎞ qu δ ⎡⎣b ( xi ) − u ⎤⎦ ⎟⎟ p y ( ) u 0 ⎠. ⎞ ⎟⎟ , ⎠. Eq. 3-4. qu δ ⎡b ( x ) − u ⎤⎦ . The first term in Eq. 3-4 is independent of y ; hence, pu ( y0 ) ⎣ i. we have to maximize the second term with respect to the vector y . Note that the second term represents the density estimate computed with the kernel profile k ( i ) at y in the current frame, with the data being weighted by wi . Denote the second term as J ( y ) . The gradient of J ( y ) with respect to y is expressed as. C ∇J ( y ) = h 2. ⎛ y − xi w k ⎜⎜ ∑ i i =1 ⎝ h nh. 2. '. ⎛ y − xi Ch nh = y ∑ wi k ' ⎜ ⎜ h h i =1 ⎝. ⎞ 2 ⎟⎟ ⋅ ( y − xi ) ⎠ h 2. ⎞ Ch ⎟⎟ − ⎠ h. ⎛ y − xi w x k ⎜⎜ ∑ i i i =1 ⎝ h nh. '. 2. ⎞ ⎟⎟ . ⎠. Eq. 3-5. By letting Eq. 3-5 be 0, we have. ⎛ y0 − xi 2 ⎞ wi xi g ⎜ ⎟⎟ ∑ ⎜ h i =1 ⎝ ⎠ y1 = 2 nh ⎛ y0 − xi ⎞ , wi g ⎜ ⎟⎟ ∑ ⎜ h i =1 ⎝ ⎠ nh. Eq. 3-6. 18.

(28) where g ( i ) = −k ' ( i ) , named the shadow kernel, is also the profile of the radial basic function,. ⎛ y − xi and wi g ⎜ 0 ⎜ h ⎝. 2. ⎞ ⎟⎟ can be considered as the convolution of wi and the shadow kernel. ⎠. Based on Eq. 3-6, an iterative procedure is obtained. To achieve better performance, the kernel function should be selected carefully. Kernels with the Gaussian profile or the Epanechnikov profile are recommended. For Gaussian kernel, the derivative of the profile, g ( i ) , is still a Gaussian function. For the Epanechnikov kernel, we have. ⎧ 1 −1 ⎪ c ( d + 2 )(1 − x ) , if x ≤ 1 k ( x) = ⎨ 2 d , , otherwise ⎪⎩ 0. Eq. 3-7. where cd is the volume of a unit d -dimension sphere and d equals to 2 in this case. The derivative of the profile is constant and Eq. 3-6 reduces to nh. y1 =. ∑ wi xi i =1 nh. ∑ wi. Eq. 3-8. .. i =1. By using the Gaussian kernel in Eq. 3-6, a mean-shift tracking process can be build. Figure 3.3 shows the mean-shift tracking flow, where the detection step is done by hand.. 19.

(29) Figure 3.3 Mean-shift tracking flow Employing the mean-shift tracking algorithm and letting y0 in Figure 3.2 (b) be the start point of the mean-shift process, the tracking result is shown below.. (a). (b). Figure 3.4 (a) The moving object to build target model (Frame 15, sequence Hans). (b) The mean-shift process starts from the red point y0 . The blue point y is the ideal tracking result. However, the mean-shift tracking result converges to the green point y1 instead (Frame 250, sequence Hans). 20.

(30) It is observed that the green bounding box in Figure 3.4 (b) which represents the tracking result is not accurate. Theoretically, y1 should have the maximal Bhattacharyya coefficient. The color histogram of the three bounding boxes in Figure 3.4 (b) is shown below. By checking the target model and the computed Bhattacharyya coefficient, y1 indeed has the maximal similarity value and is at the peak of the similarity surface.. (a). (b). (c). (d). Figure 3.5 (a) The target model. (b) The color histogram and the similarity value with respect to the tracking result y1 . (c) The color histogram and the similarity value with respect to the starting point y0 . (d) The color histogram and the similarity value with respect to the expected tracking result y .. 21.

(31) Figure 3.6 The similarity surface, the initial and final locations of the mean-shift process. y is with respect to the expected tracking result.. 3.1.3 Discussion To sum up, building an appropriate model is to reduce redundant information but keep useful features for tracking. Based on the chosen model, we then choose an appropriate similarity measure. We can find the maximal similarity value by taking differentiation over the Bhattacharyya coefficient with respect to y . With to the use of the radial basic function, this differentiation operation will deduce the iterative mean-shift formula. According to the simulation results in Section 3.1.2, color histogram doesn’t seem to be an ideal representation of object appearance since the spatial information is discarded. Lack of spatial information, together with the distraction caused by background pixels, causes poor accuracy in tracking. By building a target model to contain more information, both in the spatial domain and the feature domain, and developing an appropriate similarity measure, we can improve the tracking performance.. 22.

(32) 3.2 Multi-Blob Model Based Mean-Shift Tracking Building a model to represent the features of the target can reduce the redundant data and computational complexity. In this thesis, we purpose a new target model, called a multi-blob model, to represent the features of object. Base on the multi-blob model, we design a similarity measure and a corresponding mean-shift tracking algorithm.. 3.2.1 Moving Object Detection To build a target model, we have to know the position of the moving object first. In our tracking system, a pan-tilt-zoom camera is used. Thus, the background may change. Here, we employ a motion compensation technique to detect moving objects in the frame. Our goal is to detect the region of a moving object in a rough but efficient manner. We choose a small number of blocks, say three or four, and estimate their motion vectors. We then compensate the motion of all blocks based on these motion vectors. The residual between the compensated frame and the reference frame will indicate the area of the moving object. Figure 3.7 shows two frames in the Watson sequence, with image size being 480 × 640 . There is only one moving object in these frames and the camera pans during the capture of these two images.. (a). (b). Figure 3.7 (a) Frame 75 in the Watson sequence. (b) Frame 85 in the Watson sequence. To define the location of the moving object by using a covariance ellipse, we have to gather the statistics of the foreground region. Figure 3.8 shows the flow of calculating mean and covariance matrix of the foreground region.. 23.

(33) Figure 3.8 Flow of calculating the statistics. The statistics is used to represent the covariance ellipse. From the flow shown in Figure 3.8, we have ⎡ 1217.1 −1019.7 ⎤ V =⎢ ⎥ . ⎣ −1019.7 6246.5 ⎦ ⎡⎣σ x σ y. The. bounding. ellipse. can. μ = [ 408.97 273.04] and be. represented. by. both. ρ ⎤⎦ and [ a b θ ] , where a represents the major axis length, b represents. the minor axis length, θ is the offset angle measured clockwise from the y-axis, and the. 24.

(34) focal points are f1 ( c sin θ , c cos θ ) and f 2 ( −c sin θ , −c cos θ ) . Figure 3.9 shows the relation between these two tuples.. Figure 3.9 Relation between ⎡⎣σ x σ y. ρ ⎤⎦ and [ a b θ ] .. From the mathematical definition of ellipse, we have. ( x − c sin θ ) + ( y − c cos θ ) 2. 2. +. ⇒ −2c sin θ x − 2c cos θ y = 4a 2 + 4a. ( x + c sin θ ) + ( y + c cos θ ) 2. ( x + c sin θ ) + ( y + c cos θ ) 2. 2. 2 2 ⇒ a 2 ⎡( x + c sin θ ) + ( y + c cos θ ) ⎤ = ⎡⎣ a 2 + c sin θ x + c cos θ y ⎤⎦ ⎣ ⎦. ⇒ x 2 + c 2 sin 2 θ + y 2 + c 2 cos 2 θ = a 2 +. 2. = 2a. + 2c sin θ x + 2c cos θ y. 2. c2 ⎡sin 2 θ x 2 + 2sin θ cos θ xy + cos 2 θ y 2 ⎤⎦ a2 ⎣. ⇒ ( a 2 − c 2 sin 2 θ ) x 2 − 2c 2 sin θ cos θ xy + ( a 2 − c 2 cos 2 θ ) y 2 = constant A. Eq. 3-9. Regarding the exponent of bi-variance Gaussian function, we have. x2. σ. 2 x. −. 2 ρ xy. σ xσ y. +. y2. σ. 2 y. = d2. ⇒ σ y2 x 2 − 2 ρσ xσ y xy + σ x2 y 2 = constant B. Eq. 25.

(35) 3-10. Comparing Eq. 3-9 and Eq. 3-10, the major axis length, minor axis length, offset angle and eccentricity can be obtained.. σ y2 = ( a 2 − c 2 sin 2 θ ) σ x2 = ( a 2 − c 2 cos 2 θ ) ⇒ (σ y2 − σ x2 ) = c 2 cos 2θ. Eq. 3-11. 2σ xy = 2 ρσ xσ y = c 2 sin 2θ. Eq. 3-12. σ 2 − σ x2 ) 1 −1 2σ xy 1 −1 ( y ⇒ θ = sin = cos c2 c2 2 2. Eq. 3-13. From Eq. 3-11 and Eq. 3-12,. (σ. c2 =. 2 y. − σ x2 ) + 4σ xy2 2. Eq. 3-14. ∴ a 2 = σ y2 + c 2 sin 2 θ = σ x2 + c 2 cos 2 θ. Eq. 3-15. b 2 = a 2 − c 2 = σ y2 − c 2 cos 2 θ = σ x2 − c 2 sin 2 θ. Furthermore, the area of the ellipse is π ab . From Eq. 3-15 and Eq. 3-16, we have a 2b 2 = (σ y2 + c 2 sin 2 θ )(σ y2 − c 2 cos 2 θ ) c4 = σ − c σ cos 2θ − sin 2 2θ 4 4 y. 2. 2 y. = σ y4 − σ y2 (σ y2 − σ x2 ) − σ xy2 = det (V ). 26. Eq. 3-16.

(36) ∴π ab = π det (V ). Eq. 3-17. Based on the above equations, we can mark the 2-sigma contour as a bounding ellipse to indicate the foreground region. Figure 3.10 shows the detection result of the Watson ⎡ 1217.1 −1019.7 ⎤ sequence where μ = [ 408.97 273.04] and V = ⎢ ⎥. ⎣ −1019.7 6246.5 ⎦. Figure 3.10 Using a bounding ellipse to define the moving object.. 27.

(37) 3.2.2 Multi-Blob Model Color histogram is a robust representation of the object’s color occurrence. In [14], spatiogram is considered as a generalized histogram containing spatial information. To allocate pixels into certain color bins can be regarded as doing quantization in the feature space. However, there may be variations between pixels belonging to the same color bin. Figure 3.11 shows two different patterns which have the same values of spatiogram.. Figure 3.11 Two different patterns which cannot differentiated by the spatiogram. To preserve feature domain information, we gather statistics in the feature space for each color bin. That is, we define the multi-blob model as. Model ( d ) = nd , μsd ,Vsd , μcd ,Vcd , Td ,. d = 1,..., B ,. Eq. 3-18. where nd is the number of pixels in the dth bin, μ sd and Vsd are the mean vector and covariance matrices in the spatial domain, μcd and Vcd are the mean vector and covariance matrices in the RGB color space. The number B is the number of color bins. According to this model, object pixels can be separated into B blobs. Nevertheless, the blobs should have different reliabilities. Thus, we add another parameter Td to represent the reliability of each blob. Here, Td is initialized to 1. We can update reliabilities according to the status or the situation during tracking.. 28.

(38) To make the calculation more efficient, we introduce the recursive formula to gather the statistical measures μ sd , Vsd , μcd and Vcd . By definition, n +1. n. μ x ,n =. ∑ xi i =1. n. ∴ μ sd ,n +1 =. ⇒ μ x , n +1 =. ∑ xi i =1. n +1. ∑(x − μ ) i =1. i. n +1. +. Eq. 3-19. xn +1 x n = μ x ,n + n +1 n +1 n +1 n +1. Eq. 3-20. 2. n. ∑ ( xi − μ x,n+1 ) i =1. n +1. n. ∑(x − μ i =1. i. i =1. x,n. n +1. where. =. yn ⎤⎦ , cn = ⎡⎣rn gn bn ⎤⎦ .. n. ⇒ σ x2, n +1 =. ∑x. n x n c μ sd ,n + n +1 ; μcd ,n +1 = μcd ,n + n +1 , n +1 n +1 n +1 n +1. where xn = ⎡⎣ xn. σ x2,n =. n. i. ∑(x − μ. =. i. i =1. ) +(x 2. x , n +1. n +1. − μ x , n +1 ). 2. ,. n +1. n. ) − ∑(x − μ ) 2. x , n +1. n. 2. i =1. 2. x ,n. i. n. 2 2 = ∑ ⎡( xi − μ x ,n +1 ) − ( xi − μ x , n ) ⎤ ⎣ ⎦ i =1. n. = ∑ ⎡⎣( μ x ,n − μ x , n +1 )( 2 xi − μ x , n +1 − μ x ,n ) ⎤⎦ i =1. ⎡ n ⎤ = ( μ x ,n − μ x ,n +1 ) ⎢ ∑ xi − nμ x ,n +1 ⎥ . ⎣ i =1 ⎦. Eq. 3-21. n. Based on Eq. 3-20, Eq. 3-21 becomes. ∑(x − μ i =1. n. Hence, ∴σ x2,n +1 =. =. ∑(x − μ ) i =1. i. x ,n. 2. +. i. x , n +1. n. ) − ∑(x − μ ) 2. i =1. i. 2 2 1 xn +1 − μ x ,n +1 ) + ( xn +1 − μ x ,n +1 ) ( n n +1. 2 n 1 σ x2,n + ( xn +1 − μ x ,n +1 ) , where σ 12 = 0 . n +1 n. 29. x ,n. 2. =. 2 1 μ x ,n +1 − xn +1 ) ( n.

(39) Similarly,. Vsd ,n +1 =. σ xy ,n +1 =. n 1 σ xy,n+1 + ( xn+1 − μ x,n+1 ) ( yn+1 − μ y,n+1 ) .Thus, we have n +1 n. T n 1 ⎡1 0 ⎤ Vsd , n + ( xn +1 − μ sd , n +1 ) ( xn +1 − μ sd ,n +1 ) , where Vsd ,1 = ⎢ ⎥, n +1 n ⎣0 1 ⎦. Eq. 3-22. ⎡1 0 0 ⎤. T 1 Vcd , n +1 = Vcd , n + ( cn +1 − μ cd ,n +1 ) ( cn +1 − μcd , n +1 ) , where Vcd ,1 = ⎢0 1 0 ⎥ . Eq. 3-23 ⎢ ⎥ n +1 n. n. ⎢⎣0 0 1 ⎥⎦. We let the diagonal terms of initial value be 1 to ensure the covariance matrices invertible. Figure 3.12 shows the multi-blob model build from the detection result shown in Figure 3.10. In the RGB color space, we have 4 bins per channel to build the model. Separating each channel into too many bins may loss correlation between feature domain and spatial domain. We rank the blobs according to nd , and mark the top 5 blobs with their mean values in the RGB color space.. Figure 3.12 Five blobs that appear most frequently are marked with their mean values in the RGB color space. Based on the multi-blob model, we can roughly know how the colors distribute in the bounding ellipse in both spatial domain and feature domain. The knowledge of the relative positions among these blobs can increase tracking accuracy. 30.

(40) 3.2.3 Similarity Measure We are not willing to define the similarity between two multi-blob models but to define the similarity between a region of the image frame and a multi-blob model. The mean-shift process deduced from the similarity between two multi-blob models has to build a model for every iteration. This will cause information loss and increase computational complexity. In Eq. 2-10, we accumulate the similarity value caused by each pixel. Comparing to the model, a closer pixel with a similar color causes a larger similarity. We follow this idea and define our similarity measure based on a multi-blob model. Let y be the center of the target ellipse and μ be the center of the model ellipse. Then, we define the similarity J as. J ( y) =. 1 N. B. ∑∑ w ( ( y − y ) − ( μ i. i∈I d =1. sd. − μ ) , Vsd ) w ( ci − μcd , Vcd ) Td δ ( b ( yi ) − d ) ,. Eq. 3-24. ⎛ xV −1 x T ⎞ where w ( x , V ) = exp ⎜ − ⎟ is a multi-variant Gaussian kernel, I = { yi , ci } is the 2 ⎠ ⎝. samples enclosed by the target ellipse, and N is the number of samples. Each pixel is weighted by the Gaussian kernels in both spatial domain and feature space and then multiplied by the reliability coefficient. The definition can be regarded as the mean of the weighting. Again, as an example, we detect the moving object and build a multi-blob model using Frame 15 of the Hans sequence. We apply the same bounding ellipse to Frame 250 and calculate the similarity surface around the object. Then, we find the location y with the maximal similarity value. Figure 3.13 shows that the similarity measure now is more discriminative than the Bhattacharyya coefficient, which has been shown in Figure 3.6. The similarity measure can be generalized to be. J ( y) =. 1 N. B. ∑∑ w ( ( y. i. i∈I d =1. − y ) − ( μ sd − μ ) , Vsd ) w ( ci − μ cd , Vcd ) Td δ ( b ( yi ) − d ) , α. β. Eq. 3-25. where α and β represent the dominance of the Gaussian kernel. The spatial Gaussian kernel will dominate if α >> β . If both α and β are much smaller than 1, the similarity will become. 1 J ( y) ≅ N. B. ∑∑ T δ ( b ( y ) − d ) , i∈I d =1. d. Eq. 3-26. i. which corresponds to the average reliability of the samples enclosed by the bounding ellipse. For simplicity, we choose α = β = 1 . 31.

(41) . (a). (b). (c) Figure 3.13 (a) The moving object detected in Frame 15 of the Hans sequence. Build a multi-blob model according to the red bounding ellipse. (b) The region with the maximal similarity with the model in Frame 250 of the Hans sequence. (c) The similarity surface where y is the center of the bounding ellipse in (b).. 32.

(42) 3.2.4 Mean-Shift Tracking Procedure The distance function is defined as L ( y ) = − log J ( y ) .. Eq. 3-27. The gradient of the distance function with respect to the vector y is ∇L ( y ) = −. ∇J ( y ) , J ( y). Eq. 3-28. where 1 N. ∇J ( y ) =. 1 N. =. B. ∑ ∑ ∇w ( ( y. i. i∈I d =1 B. ∑ ∑ ⎡⎣( μ i∈I d =1. sd. − y ) − ( μ sd − μ ) , Vsd ) w ( ci − μ cd , Vcd ) Td δ ( b ( yi ) − d ). − μ ) − ( yi − y ) ⎤⎦ g ( Δy − Δμ , Vsd ) w ( Δc , Vcd ) Td δ ( b ( yi ) − d ) ,. and g ( i ) = − w ( i ) is the shadow kernel, which is also a multi-variant Gaussian profile in our case, with Δy = ( yi − y ) , Δμ = ( μ sd − μ ) , Δc = ( ci − μcd ) . Thus, Eq. 3-28 becomes B. ∇L ( y ) =. ∑ ∑ ⎡⎣( y i∈I d =1. i. − y ) − ( μ sd − μ ) ⎤⎦ g ( Δy − Δμ , Vsd ) w ( Δc , Vcd ) Td δ ( b ( yi ) − d ) B. ∑ ∑ g ( Δy − Δμ , V ) w ( Δc , V ) T δ ( b ( y ) − d ) sd. i∈I d =1. B. =. ∑ ∑[ y i∈I d =1. i. cd. d. i. − μ sd ] g ( Δy − Δμ , Vsd ) w ( Δc , Vcd ) Td δ ( b ( yi ) − d ) B. ∑ ∑ g ( Δy − Δμ , V ) w ( Δc , V ) T δ ( b ( y ) − d ) sd. i∈I d =1. cd. d. − y+μ.. Eq. 3-29. i. By letting Eq. 3-29 equal to 0, we have B. y1 =. ∑ ∑[ y i∈I d =1. i. − μ sd ] g ( Δy0 − Δμ , Vsd ) w ( Δc , Vcd ) Td δ ( b ( yi ) − d ) B. ∑ ∑ g ( Δy i∈I d =1. 0. − Δμ , Vsd ) w ( Δc , Vcd ) Td δ ( b ( yi ) − d ). +μ,. where Δy0 = yi − y0 and an iterative mean-shift formula is obtained. 33. Eq. 3-30.

(43) Or by letting ∇J ( y ) = 0 , we have B. ∑ ∑ ⎡⎣( μ i∈I d =1. sd. − μ ) − ( yi − y ) ⎤⎦ g ( Δy − Δμ , Vsd ) w ( Δc , Vcd ) Td δ ( b ( yi ) − d ) = 0 B. ⇒ ( y − μ ) ∑ ∑ g ( Δy − Δμ , Vsd ) w ( Δc , Vcd ) Td δ ( b ( yi ) − d ) i∈I d =1. B. = ∑ ∑ ( yi − μ sd ) g ( Δy − Δμ , Vsd ) w ( Δc , Vcd ) Td δ ( b ( yi ) − d ) i∈I d =1. B. ⇒ y1 =. ∑ ∑[ y i∈I d =1. − μ sd ] g ( Δy0 − Δμ , Vsd ) w ( Δc , Vcd ) Td δ ( b ( yi ) − d ). i. B. ∑ ∑ g ( Δy i∈I d =1. 0. − Δμ , Vsd ) w ( Δc , Vcd ) Td δ ( b ( yi ) − d ). +μ,. which is the same as Eq. 3-30. For the generalized similarity measure as expressed in Eq. 3-25, we have. ∇J ( y ) =. 1 N. B. ∑∑ ⎡⎣( μ i∈I d =1. ×g. − μ ) − ( yi − y ) ⎤⎦ α w ( ( yi − y ) − ( μ sd − μ ) , Vsd ). α −1. sd. ( Δy − Δμ ,Vsd ) w ( ci − μcd ,Vcd ). β. Td δ ( b ( yi ) − d ) ,. where g ( i ) = − w ( i ) is also a multi-variant Gaussian profile in our case. Therefore, ∇J ( y ) = 0 B. ⇒ ∑∑ ⎡⎣( μ sd − μ ) − ( yi − y ) ⎤⎦ g ( Δy − Δμ , Vsd ) w ( ci − μcd , Vcd ) Td δ ( b ( yi ) − d ) = 0 α. β. i∈I d =1. B. ⇒ y1 =. ∑ ∑[ y i∈I d =1. − μ sd ] g ( Δy0 − Δμ , Vsd ) w ( Δc , Vcd ) Td δ ( b ( yi ) − d ) α. i. B. ∑ ∑ g ( Δy i∈I d =1. β. − Δμ , Vsd ) w ( Δc , Vcd ) Td δ ( b ( yi ) − d ) α. 0. β. which is the generalized mean-shift equation.. 34. +μ,. Eq. 3-31.

(44) 3.2.5 Updating Size and Orientation To update the size and orientation of the bounding ellipse, we refer to the region around the bounding ellipse. The bounding ellipse is defined as the 2-sigma contour. Here, we consider the band between the 3-sigma contour and the 2-sigma contour as the background field. Figure 3.14 indicates the foreground region and the background region based on a bounding ellipse.. Figure 3.14 The foreground region and the background region based on a bounding ellipse. To reach the modification of bounding ellipse, we modify each color blob first. We re-compute μ sd and Vsd of each blob by weighting the samples belonging to I B and I F . Again, for computational efficiency, the formula must be recursive. Let qi = w ( ( yi − y ) − ( μ sd − μ ) , Vsd ) w ( ci − μ cd , Vcd ) α. ith sample yi = ( xi , yi ) , where d = b ( yi ) . n +1. n. μ x ,n =. ∑ qi xi i =1 n. ∑q i =1. ⇒ μ x , n +1 =. i. ∑ qi xi i =1 n +1. ∑q i =1. =. i. n. =. ∑q x i =1 n +1. i i. ∑q i =1. i. +. q. n +1 n +1. ∑q i =1. xn +1. i. Qn qn +1 μ x ,n + xn +1 , Qn + qn +1 Qn + qn +1. 35. β. be the weighting of the.

(45) n. where Qn = ∑ qi , μ0 = 0 . i =1. Similarly,. Qn qn +1 μ y ,n + yn +1 , Qn + qn +1 Qn + qn +1. μ y ,n +1 =. Qn qn +1 yn +1 μn + Qn + qn +1 Qn + qn +1. ∴ μ n +1 = n. σ x2,n =. ∑q (x − μ ) i. i =1. n. ∑q i =1. ⇒ σ x2, n +1 =. i. n. where. n +1. 2. x ,n. i. Eq. 3-32. ∑q (x − μ i =1. i. i. n. n. 2. ∑q. =. ∑q (x − μ i =1. i. i. x , n +1. ). 2. + qn +1 ( xn +1 − μ x , n+1 ). Qn + qn +1. i. ∑ qi ( xi − μ x,n+1 ) − ∑ qi ( xi − μ x,n ) i =1. ). n +1 i =1. 2. x , n +1. 2. i =1. n. = ∑ qi ⎡⎣( μ x , n − μ x , n +1 )( 2 xi − μ x , n +1 − μ x , n ) ⎤⎦ i =1. ⎡ n ⎤ = ( μ x ,n − μ x ,n +1 ) ⎢ ∑ qi xi − μ x ,n +1Qn ⎥ ⎣ i =1 ⎦. qn2+1 2 = ( μn+1 − xn+1 ) Qn n. ∴σ x2,n +1 =. =. ∑ qi ( xi − μ x,n ) + i =1. 2. 2 2 qn2+1 xn +1 − μ x , n +1 ) + qn +1 ( xn +1 − μ x , n +1 ) ( Qn Qn + qn +1. 2 Qn q σ x2,n + n +1 ( xn +1 − μ x ,n +1 ) , Qn + qn +1 Qn. 2 2 where σ x ,0 = σ x ,1 = 0 .. Similarly,. σ y2,n +1 =. 2 Qn q σ y2,n + n +1 ( yn +1 − μ y ,n +1 ) , Qn + qn +1 Qn. 36. 2.

(46) σ xy ,n +1 =. Qn q σ xy ,n + n +1 ( xn +1 − μ x ,n +1 ) ( yn +1 − μ y ,n +1 ) , Qn + qn +1 Qn. ∴Vn +1 =. Qn q T Vn + n +1 ( yn +1 − μn +1 ) ( yn +1 − μn +1 ) . Qn + qn +1 Qn. Eq. 3-33. Based on Eq. 3-32 and Eq. 3-33, each blob can be updated. To obtain the new bounding ellipse, we have to merge together all the blobs. By definition, n. nd. μ sd =. ∑ yi i =1. nd. ⇒μ=. ∑ yj j =1. n. B. =. nd. B. ∑∑ yi. =. d =1 i =1 B. ∑n d =1. ∑n μ d. d =1 B. Eq. 3-34. .. ∑n. d. sd. d. d =1. nd. σ xd2 =. ∑x. 2 i. i =1. nd. nd. − μ xd2 ⇒ ∑ xi2 = nd (σ xd2 + μ xd2 ) , i =1. B. n. σ x2 =. ∑x j =1. 2 j. − μ x2 =. n. ∑∑ x2 d =1 nd B. ∑n d =1. ∑ n (σ. − μ x2 =. d =1. d. ∑ n (σ. 2 + μ yd ). σ y2 =. d =1. 2 yd. d. B. ∑n. ∑n. d. d =1. ∑ n (σ B. σ xy =. d =1. d. − μ y2 ,. d. d =1. xyd. + μ xd μ yd ). B. ∑n d =1. 2 xd. B. Similarly, B. + μ xd2 ). B. − μx μ y ,. d. 37. d. − μ x2 ..

(47) ∑ n (V B. ∴V =. d. d =1. sd. + μ sdT μ sd ). B. ∑n. − μT μ. Eq. 3-35. d. d =1. From Eq. 3-34 and Eq. 3-35, the overall mean vector and covariance matrix can be regarded as the weighted combination of blobs with weighting nd . To consider the reliability, we modify Eq. 3-34 and Eq. 3-35 to be B. μ=. ∑ nd Td μ sd d =1 B. ∑n T. γ. γ ∑ n T (V B. γ. ; V =. d =1. d d. + μ sdT μ sd ). B. ∑n T. d d. d =1. sd. d =1. γ. − μT μ ,. Eq. 3-36. d d. where γ represents the dominance of reliability. A large γ will cause the mean vector and covariance matrix to be dominated by certain blobs with high reliability. When using Gaussian profile to estimate a distribution f ( x, y ) , the variance should be obtained from σ x2 = ∫. ∞. ∞. ∫ ( x − μ ) f ( x, y ) dxdy , 2. −∞ −∞. where μ = ∫. ∞. ∫. ∞. −∞ −∞. xf ( x, y ) dxdy . In our. case, the samples are weighted by qi = w ( ( yi − y ) − ( μsd − μ ) , Vsd ) w ( ci − μcd , Vcd ) . α. β. Hence, the variance should be. σ x2 =. 2π V. ∫ ∫ ( x − μ ) w ([ x ∞. 1 1. 2. ∞. −∞ −∞. 2. x. y ] − ⎡⎣ μ x. μ y ⎤⎦ ,V. ). α. w ( ci − μcd ,Vcd ) dxdy , β. Eq. 3-37. where w ( ci − μcd ,Vcd ) is a Gaussian random variable independent of x and y . Due to the independence, w ( ci − μcd ,Vcd ) does not affect the variance. However, the above procedure only considers the region enclosed by the t-sigma contour. Hence, the covariance matrix obtained by Eq. 3-36 is smaller than the true one. ⎡ σ 2 σ xy ⎤ Let the true covariance matrix be V = ⎢ x = f (α , t ) × V . Since f (α , t ) is 2 ⎥ ⎢⎣σ xy σ y ⎥⎦. independent of orientation, we assume σ xy = 0 and μ = [ 0 0] for simplicity.. 38.

(48) σ x2 =. −α y 2. −α x 2. 1 2πσ xσ y. ∫∫. 2σ y2. x 2 e 2σ x e 2. I. dxdy ,. Eq. 3-38. where I is the region enclosed by the t-sigma contour. By letting X =. αx αy ,Y = , Eq. σx σy. 3-38 becomes. σ x2 2 σx = 2πα 2. ∫∫. −X2 2 2. I. X e. e. −Y 2 2. dXdY .. Let X = R cos θ and Y = R sin θ , we have. σ x2 σ = 2πα 2 2 x. σ x2 = 2πα 2. 2π. ∫ ∫ 0. ∫. 2π. 0. 0. αt. R cos θ e 3. 2. cos 2 θ dθ ∫. αt. 0. − R2 2. dRdθ ,. − R2 3 2. Re. dR .. By partial integral, we let u = R , dv = R e 2. σ x2 2 σx = 2πα 2. ∫. 2π. 0. 2 ⎡ cos 2θ + 1 ⎢ 2 − 2R dθ − R e ⎢ 2 ⎣. ⎡. −α t −R σ x2 ⎢ 2 2 − 2e 2 = 2 −α t e 2α ⎢ 2. ⎣⎢. σ x2 = 2 α. −α t ⎡ α t 2 −α2t 2 e − ⎢1 − e 2 ⎢⎣ 2. 2. αt. 2. 0. − R2 2. dR ⇒ dv = 2 RdR , v = −e. αt. + 2∫. 0. 0. αt. Re. − R2 2. − R2 2. .. ⎤ dR ⎥ . ⎥ ⎦. ⎤ ⎥ ⎥ ⎦⎥. ⎤ ⎥. ⎥⎦. Eq. 3-39. Therefore, the new covariance matrix obtained by Eq. 3-36 should be multiplied by the −α t ⎛ α t 2 −α2t 2 2 − e factor f (α , t ) = α ⎜⎜1 − e 2 ⎝ 2. 2. −1. ⎞ ⎟ . That is, ⎟ ⎠ 39.

(49) B. μ=. ∑ nd Td μ sd γ. d =1 B. ∑n T d =1. d d. γ. ⎡ B ⎤ γ T ⎢ ∑ nd Td (Vsd + μ sd μ sd ) ⎥ T d =1 ⎢ ⎥ f (α , t ) . = − V μ μ ; B γ ⎢ ⎥ nd Td ∑ ⎢⎣ ⎥⎦ d =1. Eq. 3-40. For t = 2 , α = 1 , f (α , t ) = 1.6835. In Figure 3.14, the bounding ellipse is moved to the position with the maximal similarity. The motion of bounding ellipse is pure translation. Applying this updating process after the tracking process will make the bounding ellipse better fit the moving object. We set the γ in Eq. 3-40 to be 0.3. Figure 3.15 shows the result of the size and orientation updating.. Figure 3.15 Result of size and orientation updating. The blue ellipses represent the 2-sigma contour and the 3-sigma contour with respect to the original covariance matrix. The red ellipse represents the 2-sigma contour with respect to the modified covariance matrix.. 40.

(50) 3.2.6 Updating Reliability To avoid distraction of tracking, colors appearing in both foreground field and background field should have lower reliability. Furthermore, if a color blob represents background pixels, the position relative to the center of the bounding ellipse will be varying during the tracking process. Based on these ideas, we proposed two strategies to update the reliability coefficient. Similar to Figure 3.14, we define the background field and foreground field as Figure 3.16(a). We let H f ( d ) be a color histogram for the pixels in the foreground field and let H b ( d ) be a color histogram for the pixels in the background field.. (a). (b). Figure 3.16 (a) The outer ellipse represents the 3-sigma contour, and the inner ellipse represents the 2-sigma contour with respect to the modified covariance matrix. (b) Color histogram of the background field and foreground field. Define Ts ( d ) =. H f (d ). max { H b ( d ) , δ }. , where δ is a small value, say 0.001, to prevent the dividing. by zero. Blobs with a small Ts ( d ) value may get distracted by the background. At the same time, we use the samples of foreground field to build a multi-blob model as the target candidate. The target candidate is denoted as. 41.

(51) Candidate ( d ) = nd' , μsd' ,Vsd' , μcd' ,Vcd' , Td' ,. d = 1,..., B ,. Eq. 3-41. To measure the movement of a blob relative to the center, we define. ψ d = ( Δμd' − Δμd )Vˆsd−1 ( Δμd' − Δμd ). T. Eq. 3-42. −1 as the translation, where Δμ d = μ sd − y1 , Δμd = μsd − μ and Vˆsd =. '. '. (. ). 1 ' −1 −1 Vsd + Vsd . 2. Notice that ψ d is the average of two Mahalanobis distance, one between Δμd' and Δμd , and the other between Δμd' and Δμd . The reliability of the d -th blob is thought as increasable if ψ d lies in the 1-sigma contour. Reliability should be bounded to prevent the tracking behavior being dominated by certain blobs. On the other hand, reliability should spread widely to have discrimination between blobs. Thus, we define the reliability function as. 1 + aX 2 T (X ) = , a+ X2. Eq. 3-43. where a represents the discrimination between blobs. Low reliability blobs can be ignored if a is large enough.. Figure 3.17 Reliability function, both x-axis and y-axis are in the log scale. 42.

(52) From Ts ( d ) ,ψ d and reliability function, we develop our reliability updating strategy. There are five steps in updating reliability. Step 1. Td =. 1 + aX 2 aTd − 1 ⇒ X0 = . 2 a+ X a − Td. Step 2. X 1 = X 0 e. (1−ψ d2 ) 2. .. Step 3. Case 1: Ts ( d ) ≥ 3. , X 2 = X 1 ×1.5. Case 2: 2 ≥ Ts ( d ) > 1. , X2 =. X1 1.5. Case 3: 1 ≥ Ts ( d ). , X2 =. X1 . 3. 1 ⎞ ⎛ Step 4. X = max ⎜ min (10a, X 2 ) , ⎟. 10a ⎠ ⎝ Step 5. Td _ new. 1 + aX 2 = . a+ X2. In Step 4, X is bounded to prevent Td = a , which will cause “dividing by zero” in Step 1. Another reason is to let the reliability be more sensitive. We choose a = 20 in practice. In Figure 3.18, pixels with high reliability are marked in white; oppositely, pixels with low reliability are marked in black. Pixels remaining the original color represent medium reliability. Based on the reliability map, moving object can be roughly identified.. Figure 3.18 The reliability map, which can roughly identify the moving object.. 43.

(53) 3.2.7 Updating Target Model Since a target candidate is already obtained when we update reliability, all we have to do is to decide when to update the model. The reliability map of two frames in the Watson sequence is shown in Figure 3.19. Considering the situation in Figure 3.19(a), obviously, this moment is not appropriate to update the target model. The unexpected background information may cause tracking failure. In Figure 3.19(b), the bounding ellipse tracks the moving object well, and the background is with low reliability. This is a suitable moment to update the target model.. (a). (b) Figure 3.19 Frames with pixels marked depend on reliability. (a) Frame 50, Sequence Watson. (b) Frame 65, Sequence Watson. 44.

(54) To evaluate the adequacy of updating target model, we introduce the two-class variance ratio. Let Td _ f be the set of reliability for pixels in the foreground field and Td _ b be the set of reliability for pixels in the background field.. Variance Ratio =. var (Td _ f + Td _ b ). var (Td _ f ) + var (Td _ b ). ,. Eq. 3-44. The variance ratio of reliability map shown in Figure 3.19(a) is 0.5866 and the variance ratio of reliability map shown in Figure 3.19(b) is 0.9280. We set a threshold at 0.65. Thus, the target model will be updated whenever the variance ratio is larger than 0.65.. 45.

(55) 3.2.8 Overall Object Tracking Process To sum up, the flow chart of the proposed object tracking procedure is shown in Figure 3.20. Compared to the flow chart of the traditional mean-shift tracking process in Figure 3.3, we need not to build the candidate during the mean-shift iterations. The location converge condition is. y1 − y0. 2. < 1 and the orientation converge condition is. det (Vnew ) − 1 < 0.05 , det (V ). which means the amount of samples varying is less than 5%.. Figure 3.20 Flow chart of the proposed object tracking process. Furthermore, we can do some simple predictions between frames or between mean-shift iterations to shorten the processing time. On the other hand, the judgment of 46.

(56) target loss is an additional stage to increase robustness. We have two simple rules to judge whether the tracking failure or not. The first rule is to check the covariance matrix V obtained by Eq. 3-40. If det (V ) ≤ 0, we are not able to define the bounding ellipse. Recall Eq. 3-17, a negative area is not reasonable. In practice, instead of the bounding ellipse, a hyperbola is obtained. The second rule is to check the reliability. If all blobs are with low reliability, the tracking result is likely to be a failed one.. 47.

(57) Chapter 4. Experimental Results We present some object tracking results using the proposed algorithm in this chapter. The Resolution of all sequences is 640 × 480 . The algorithm is implemented with the MATLAB 6.5 platform and runs on a 3GHz Pentium4 PC with 512MB DRAM. The red ellipse and blue bounding box represent the result of proposed algorithm and the traditional mean-shift process, respectively. In the first experiment, our mean-shift algorithm was run on the sequence “Hans”. There is neither scene change nor occlusion in this sequence. The tracking result is shown below. The reliability map is also shown to show how the reliabilities change during tracking. The multi-blob model is built at Frame 35. All the reliabilities are initialized to 1. In addition, we ran the traditional mean-shift procedure at the same time, which employs the “plus or minus 10 percent” scale adaptation method and uses a 16 × 16 × 16 histogram in the RGB space as the model.. Frame 35. Frame 40. Frame 50. Frame 75 48.

(58) Frame 120. Frame 160. Frame 195. Frame 235. Frame 295. Frame 360. 49.

(59) Frame 420. Frame 485. Frame 515. Frame 590. Frame 630. Frame 695. 50.