以物件為基礎監控視訊內容之追蹤與摘要

全文

(1)國立交通大學資訊工程學系碩士論文. 以物件為基礎監控視訊內容之追蹤與摘要 Object-Based Video Tracking and Abstraction on Surveillance Videos. 研究生：郭慧冰指導教授：李素瑛. 教授. 中華民國九十三年六月.

(2) 以物件為基礎監控視訊內容之追蹤與摘要 Object-Based Video Tracking and Abstraction on Surveillance Videos. 研究生：郭慧冰. Student：Hui-Ping Kuo. 指導教授：李素瑛教授. Advisor：Prof. Suh-Yin Lee. 國立交通大學資訊工程學系碩士論文. A Thesis Submitted to Institute of Computer Science and Information Engineering College of Electrical Engineering and Computer Science National Chiao Tung University in Partial Fulfillment of the Requirements for the Degree of Master in Computer Science and Information Engineering June 2004 Hsinchu, Taiwan, Republic of China. 中華民國九十三年六月.

(3) 以物件為基礎監控視訊內容之追蹤與摘要. 研究生：郭慧冰. 指導教授：李素瑛. 國立交通大學資訊工程學系. 摘要影像內容摘要是藉由擷取影像資料之中的重要涵義來精簡地表示影像資料之重要的內容。而在監控視訊，我們可以用以物件為基礎之視訊摘要來表達物件所發生的事件與其代表重要的意義。因此，在視訊監控系統如智慧型運輸系統，物件為基礎之視訊摘要可以用來對物體所發生之重要事件發出警訊。此外，視訊內容摘要也可以幫助我們從監控視訊影像的資料庫中取得與管理重要的內容。在這篇論文中，我們提出了一個以物件為基礎之監控視訊之影像內容追蹤與摘要的系統。首先，我們使用以背景作為參考之動態物件切割演算法把運動物體從背景中分割出來。接著，我們使用一個簡單但有效的動態物件追蹤演算法得到運動物件的軌跡與特徵。最後，我們設計了一個藉由選取代表著重要影像內容的動態物體來產生影像內容摘要的演算法。我們用此演算法來產生以物件為基礎之影像內容摘要。我們藉由展示出我們所選取到具有代表性意義與事件的物件來證明我們提出的系統的成效。根據所提出的系統架構，我們實作了一個能夠線上發出警訊之即時影像監控. i.

(4) 系統。在這個系統中，我們可以藉由所產生的內容摘要即時地對重要的事件發出警訊並且同時對監控影像中的物體做即時的追蹤。我們測試了一些不同類型的監控視訊影像的片段，而實驗結果證明，我們的系統在物件內容之追蹤與在重要物件內容之摘要都得到滿意的結果。. ii.

(5) Object-Based Video Tracking and Abstraction on Surveillance Videos. Student: Hui-Ping Kuo. Advisor: Prof. Suh-Yin Lee. Institute of Computer Science and Information Engineering National Chiao Tung University. Abstract Video abstractions represent the video contents concisely by extracting the important semantics in the video. Important events and semantic meanings of moving objects in the surveillance videos can be represented using the object-based video abstractions. Therefore, the object-based video abstractions can be used to alarm important events in the surveillance and monitoring systems like the intelligent transportation system (ITS). Besides, the video abstractions can also help retrieving and managing important contents from the surveillance video database. In this thesis, we propose an object-based video tracking and abstraction system. First, we use a background-registration segmentation algorithm to segment the moving objects. Then, a simple but effective tracking algorithm is introduced to extract the object trajectories and object features. Finally, an abstraction algorithm is applied to generate object-based abstraction by selecting key objects which contain important semantic contents. We will reveal the performance of the tracking and abstraction algorithms by showing the results of extracting objects with representative features and events such as object appearance/disappearance, objects occlusion/split and changes in motion. We will present a real-time tracking system with on-line alarming to demonstrate the implemented system. In. iii.

(6) this system, important object events will be alarmed using the generated object-based abstraction on-line while the surveillance video is being tracked in real-time. We test our system with several surveillance video sequences and the experimental results prove that we can get satisfactory results in both the object-based tracking and abstraction.. iv.

(7) Acknowledgement I sincerely appreciate the guidance and the encouragement of my advisor, Prof. Suh-Yin Lee. She encouraged me in exploiting research topics freely and enthusiastically helped me. Without her, I cannot complete this thesis. Besides, I would like to extend my thanks to the lab mates in the Information System Laboratory, especially Mr. Duan-Yu Chen and Mr. Ming-Ho Hsiao. They gave me a lot of suggestions and shared their experience. Finally, I want to express my appreciation to my parents for their support. They gave me the opportunity to have good education. This thesis is dedicated to them.. v.

(8) Table of Contents Abstract in Chinese Abstract in English Acknowledgement Table of Contents List of Figures List of Equations List of Tables Chapter 1 Introduction 1.1 Motivation 1.2 Organization Chapter 2 Background 2.1 Video Object Segmentation. i iii v vi viii x x 1 1 2 3. 2.2 Video Object Tracking 2.3 Video Abstraction Chapter 3 Object-Based Video Tracking and Abstraction on Surveillance Videos 3.1 System Overview 3.2 Video Object Segmentation Algorithm 3.2.1 Inter-frame Differencing 3.2.2 Dynamic Threshold Decision 3.2.3 Background Buffer Update 3.2.4 Background Buffer Differencing 3.2.5 Morphological Operation 3.2.6 Connected Component and Size Filtering 3.3 Video Object Tracking Algorithm 3.3.1 Matching Function 3.3.2 Object States 3.3.3 Objects Matching Algorithm 3.3.3.1 Occluded Objects Matching 3.3.3.2 Split Objects Matching 3.3.3.3 Estimated Objects Matching 3.3.4 Temporal Filtering 3.4 Video Abstraction Algorithm 3.4.1 Object State Analysis 3.4.2 Object Trajectory Analysis 3.4.3 Video Abstraction with Selected Key Objects Chapter 4 System Architecture and Experiment Result vi. 3 4 5 7 7 7 10 10 12 12 14 15 16 18 21 24 25 26 28 30 32 33 34 34 36.

(9) 4.1 System Architecture Overview 4.2 Experimental Results of the Video Object Segmentation 4.3 Experimental Results of the Video Object Tracking 4.4 Experimental Results of the Video Abstraction 4.5 Integrated System for Real-Time Video Object Tracking and On-Line Alarming Chapter 5 Conclusion and Future Work Reference. vii. 36 37 44 50 56 58 60.

(10) List of Figures Fig.1 Segmentation process diagram Fig.2 The histogram of a difference image Fig.3 The cumulative histogram and the interpolated lines Fig.4 The process diagram of the tracking algorithm Fig.5 Illustration of the Motion Vector Distance Fig.6 Pseudo code of the matching function Fig.7 The state transition of the OBJ_STATE Fig.8 The state transition of the OCC_STATE Fig.9 The process of the whole matching algorithm Fig.10 The relationship of occlusion objects Fig.11 The occluded objects matching process Fig.12 The relationship of split objects Fig.13 The split object matching process Fig.14 The condition that estimated object is appended Fig.15 The estimated objects matching process Fig.16 The path of the mass center of a walking person Fig.17 The abstraction algorithm Fig.18 The trajectory analysis and key object selection process Fig.19 The key object exporting process Fig.20 System architecture overview Fig.21 Result comparison when combining chrominance channel Fig.22 Result compassion of the morphological operation Fig.23 Segmentation results of the clip speedway Fig.24 Segmentation results of the clip hall monitor Fig.25 Segmentation results of the clip ETRI_A Fig.26 Segmentation results of the clip ETRI_B Fig.27 Segmentation results of the clip ETRI_C Fig.28 Tracking results of the speedway sequence Fig.29 Tracking results of the ETRI_C sequence Fig.30 Tracking results of the ETRI_B sequence Fig.31 Selected key objects for the detected occlusion event Fig.32 Selected key objects for the detected split event Fig.33 Selected key objects for the ETRI_B sequence Fig.34 Selected key objects for the ETRI_B sequence Fig.35 Parts of the abstraction of the ETRI_C sequence Fig.36 Parts of the abstraction of the hall monitor sequence viii. 9 11 11 16 19 20 21 22 24 25 26 27 27 29 29 31 33 34 35 36 37 38 38 38 39 40 41 44 45-46 47 50 50 51 51 52 53.

(11) Fig.37 Parts of the abstraction of the speedway sequence Fig.38 Interface of the integrated system. ix. 54 56.

(12) Lists of Equations Equation 1 Compute the Difference Image Equation 2 Apply the Threshold on Difference Image Equation 3 Update Background Buffer Equation 4 Update Background Buffer Equation 5 Compute the Weight of the Chrominance Channel Equation 6 Compute the Weight of the Chrominance Channel Equation 7 Compute Difference with a Weighted Sum Equation 8 Compute the Background Difference Image Equation 9 Apply the Threshold on Background Difference Image Equation 10 Compute the Cosine of the Included Angle Equation 11 Compute the Motion Vector Distance (MVD) Value Equation 12 Definition of the Precision Equation 13 Definition of the Recall. 10 10 12 12 13 13 13 13 13 19 19 42 42. List of Tables Table 1 Statistics of Segmentation Result Table 2 Statistics of the Tracking and Detecting of Occlusion Events Table 3 Statistics of the Tracked Trajectories Table 4 Statistics of the Abstraction. x. 42 48 48 55.

(13) Chapter 1 Introduction. 1.1 Motivation In recent years, the video technology has advanced surprisingly. Because of the maturity of the digital video processing techniques and the compression standards, applications of digital video were widely adopted, for example, the video surveillance systems and stream media applications. However, with the traditional video encoding standard such as MPEG1 and MPEG2, it is difficult to manage and retrieve the important content or content of interest efficiently when the amount of video data repository is very huge. Thus, it is essential to manage the video content to enhance the value and the usability the produced video data. Fortunately, in the MPEG 4 encoding standard, a new feature Video Object Plane (VOP) was introduced. The concept of VOP is that video sequences can be encoded into separate bitstreams according to the contents of the video, such as moving object and background. Thus, we can segment video frames into some semantically meaningful video objects and encode these video objects separately. This feature allows us to manage and retrieve the object-level information easily and thus provides higher level semantics and interactivity. Especially in the surveillance videos, because the moving objects are the most important contents, the benefits of extracting object-level information can be fully exploited. To represent the important contents in the video concisely, the video abstraction is usually used. The video abstraction helps us to retrieve the important contents in the video. In the traditional methods, the abstraction is usually generated by extracting key frames with representative features. Since the moving objects are important in the surveillance videos, we can use object-based abstraction and the key frames can be extracted by extracting significant. 1.

(14) key objects. Because the abstraction is object-based, object-level semantics or events can be represented. The object-based video abstraction can be further applied to the surveillance systems such as the Intelligent Transportation System (ITS). The abstraction can be used to alarm object events and important contents. In this thesis, we will present an object-based video tracking and abstraction system on surveillance video. The general object events such as appearances, occlusions and the changes in motion will be detected. With the domain knowledge, these general object events can be further extended and deployed in many surveillance applications. For example, the occlusion of objects can indicate the car accident and the changes in the object motion can be used to detect whether there is illegal driving. A real-time tracking system with on-line general events alarming will be implemented in this thesis.. 1.2 Organization The rest of the paper is organized as follows. Chapter 2 introduces the background and the related work of the video object segmentation, tracking and abstraction. Chapter 3 presents our proposed algorithms for video tracking and abstraction. Chapter 4 shows the architecture of the system and the experimental results. We will make a conclusion in Chapter 5.. 2.

(15) Chapter 2 Background. In some applications in Intelligent Transportation System (ITS) such as traffic surveillance and monitoring, the goal is to monitor the moving objects in the surveillance environment. Thus, video object segmentation is required to first extract the moving objects. After that, video object tracking and abstraction processes are required to track the object trajectory and extract the significant key objects which we may be interested in. In this chapter, we will introduce the related works of the video object segmentation, video object tracking and video abstraction. The details of the related works of video objects segmentation, video object tracking and video abstraction will be introduced in Section 2.1, 2.2 and 2.3, respectively.. 2.1 Video Object Segmentation Object segmentation is the first step toward the object-level abstraction and is the task to find a mask indicating the shape and the position of the moving object. There are many researches in the literature of object segmentation. Generally, segmentation algorithms can be classified into two categories, the homogeneity based methods and the change detection based methods. The homogeneity based algorithms [1-4] segment moving objects based on the homogeneity of their color, texture or motion information. Pixels with some similar features are first grouped into small regions, and these regions are then grouped into objects with some other features. This kind [1-4] of algorithms can provide precise object masks; however, the watershed algorithm for the boundary decision is a computational expensive process. Also, the motion estimation process to compute the precise motion vectors for clustering small regions also takes a lot of time. Thus, this kind of algorithm is not a good choice for systems. 3.

(16) that have real-time requirements. The other category of segmentation algorithms is the change detection based algorithms. This kind of algorithms [5-7] segment objects by taking difference between the current input frame and a reference image, and then a threshold is chosen to decide a difference mask indicating the shape and the position of the moving objects. Traditionally, the previous input frame is chosen as the reference image and this is quite simple and efficient. However, there are some well-known drawbacks [8]. First, when the speed of the moving object is not consistent, it becomes impossible to indicate the position using the difference image and thus miss or false alarm in segmentation is unavoidable. Second, the uncovered background is another problem in traditional change-detection algorithms because the uncovered background regions that are covered by objects in previous frame may be considered as changed. Although, uncovered background can be detected and removed when the motion information is taken into consideration the computation of motion estimation is expensive and greatly lower the efficiency of the change-detection algorithms. Recently, some change detection algorithms [8-11], [15], [27] use a reference background image to segment moving objects. The reference background image is acquired beforehand or by some means to update dynamically and contains the still background without any objects. The change-detection algorithms with registered background effectively solve the problems of uncovered background and inconsistent object speed effectively. Besides, they are efficient and can meet the real-time requirement. In our proposed system, we will adopt the change-detection based algorithm with registered background to segment moving objects.. 2.2 Video Object Tracking Video object tracking is an important and frequently discussed research topic. Its objective is to match the detected objects in the current frame to the corresponding objects. 4.

(17) detected previously. The tracked position and shape of objects can be used for the generation of the VOPs in MPEG-4 for some objects or to form the object trajectories for later object-based analysis and abstraction. Thus, video object tracking plays an important role in systems to extract MPEG-4 VOPs or object-based description and abstraction in MPEG-7. The object tracking algorithms first take the detected object masks from the segmentation algorithms as input data, and try to match the objects detected earlier using some features such as the position, shape and color. For example in [12], Oberti et al. used the shape of the object corners to track video objects. Some tracking algorithms also take motion information into consideration. For example, Kim et al. use the direction of the motion and the variation of the speed to compute smoothness feature as the matching criteria [13]. Chen used the motion as the constraint to find matching objects [14]. Some other algorithms [15-17] adopted Kalman Filtering. It is a linear estimation process that estimates the current value and updates the prediction recursively, to estimate and track the position of the objects. The precision of the prediction involves two errors: the process error and the measurement error. Because sometimes there are errors in the segmentation process due to the cluttered scene, the object masks would not be very precise and hence the measurement errors would be large. Besides, some abrupt movements of the objects such as waving of hands will make the process error large. The prediction error may not converge quickly if both the process error and measurement errors are large. Thus, it may be difficult to match correctly due to the uncertainty of the prediction. In this thesis, instead of using Kalman Filtering, our algorithm uses the motion information as feature, which is efficient and effective. The occlusion and the split of objects can also be handled in our tracking algorithm. The trajectories of the objects as well as the event of occlusion and split will be stored for the object abstraction in the later stage.. 2.3 Video Abstraction. 5.

(18) The video abstraction is to represent the large amount of video data in a compact form and is generated by selecting the key frames which are representative of the features and the contents of the video. Traditionally, the key frame extraction (KFE) algorithms [18-21] are frame-based and the image features are used. For example, the video sequence is first decomposed into scenes and the scenes are then decomposed into shots. The key frames are then extracted from each shot to represent the features of that shot. The image features like color and the motion intensity are usually used as the criteria to select key frames. In the surveillance video, because the background is stationary and the moving objects are the most important contents in the video, we can choose the moving objects with important features or significant events to represent the video contents. However, the traditional KFE algorithms are not applicable because they use only low-level images features and the generated abstractions are lack of object-level semantics. In [22] and [23], Kim et al. proposed a system that selects key objects for video abstraction using the shape information of the moving objects. They tried to detect the changes in the shapes by computing moment invariant moments [12] to capture the actions of the moving objects. However, it is difficult to detect the changes in the object shapes for the rigid moving objects such as the vehicles. In our proposed abstraction algorithm, we will try to select key objects based on the characteristics of the object trajectories and the object events detected. Other object-based KFE algorithms proposed are in [24] and [25]. In [24], the number of the intra-coded macroblocks (I-MBs) to the total number of the encoded macroblocks in the VOP used as the criteria of the key object selection; while [25] detected significant changes in the shape of the VOP in the MPEG-4 compressed domain. Unlike the previous works, which used pre-segmented VOPs, we propose a system is which aims to track video objects in real-time and select key objects to generate video abstractions on-line.. 6.

(19) Chapter 3 Algorithm for Object-Based Video Tracking and Abstraction on Surveillance Videos. In this chapter, we will present the proposed system for video object tracking and abstraction. The whole system is composed of three main modules: video object segmentation, video object tracking and video abstraction. In section 3.1, we will give an overview of the whole system In section 3.2, we will present our video object segmentation algorithm. In section 3.3, we will present the video object tracking algorithm to track the moving objects detected. In section 3.4, the video abstraction algorithm will be presented.. 3.1 System Overview Our proposed system contains three modules, which are the video object segmentation module, the object tracking module and the video abstraction module. The surveillance video data are first captured and input to the video object segmentation module. The object segmentation algorithm segments the moving objects from the background and generates the object masks which indicating the position and the shape of the moving objects. The segmented object masks are then input to the video object tracking module. The object tracking algorithm matches the input object masks to those objects which have been input previously. Also, the occlusion and the splitting of the objects are detected and are reasoned in the object tracking stage. In the video abstraction module, video abstraction will be generated by selecting those frames with semantically meaningful objects contained.. 3.2 Video Object Segmentation In our object-based tracking and abstraction system, the first step is to segment the moving objects as precisely as possible. The object segmentation algorithm directly takes the 7.

(20) raw video data as input to segment the moving objects in the surveillance video sequence and extracts the object masks to indicate the presence of the moving objects. In the surveillance video, since the position of the camera is always fixed and the background is stationary, so the simplest way to segment moving objects is to use the change detection based method. Because when comparing a frame to a background image, it is straightforward to consider the regions that change significantly as moving objects. Thus, selecting background image for the change detection based algorithm as reference can effectively achieve our goal. However, besides the moving objects that we are interested in, there are other types of changing regions that may be miss-classified in the segmentation process. One of which is the camera noises. The camera noises are the white noise of the camera and are usually small. The other type is often called ‘ghost’. The ghost is the changing region that appears and then disappears quickly without steady motion and is usually bigger than the camera noise. The ghost effect is usually resulted from the waving of tree leaves and regional lighting effect. In order to obtain accurate object masks, these annoying changing regions should be filtered out. In our segmentation algorithm, we adopt the change detection based algorithm with background registration and the filtering process of noises and ghosts is also applied. The whole process of our segmentation algorithm is shown in Fig. 1.. 8.

(21) C urrent Frame. Previous Frame. Take D ifference. D ifference Image Th resh o ld in g by thd. D ifference Mask. U pdate B ackground. BG B uffer. Take D ifference BG D ifference Image Th resho ld in g b y thb. BG D ifference Mask. Morphological O peration. O bject Mask. C onnected C omponent. BG : Background. Size Filtering. C urrent Frame O bjects. Fig. 1. Segmentation process diagram. 9.

(22) 3.2.1 Inter-frame Differencing In the segmentation algorithm, the first step is to compute the inter-frame difference image (DI) between the current input frame and the previous frame. The distance value in the DI shows how strong a pixel changes in two consecutive frames and the possibility that a pixel will be considered as changing. Because the human eyes are more sensitive to luminance than to chrominance, we only take difference value on the luminance channel. After taking threshold THd on the difference image, we can obtain a difference mask (DM) that indicates the changed regions between two consecutive frames. The computations of DI and DM are shown in Eq. (1) and Eq. (2), where the CY(i,j) and PY(i,j) denotes the pixel value in current frame and previous frame in the luminance channel respectively. The DN will be used to update the background image in the next step. DI(i,j) = C Y (i, j ) − P Y (i, j ) ⎧0 DM (i, j ) = ⎨ ⎩255. (1) if DI (i, j ) > TH d if DI (i, j ) ≤ TH d. (2). 3.2.2 Dynamic Threshold Decision In order to make our segmentation algorithm adapt to various kinds of environments and video contents, the threshold THd for deciding the DM cannot be fixed and should be selected adaptively. In many researches [8], [10], [26-27], the values in the difference image can be modeled by a mixture of two distributions. And thus, finding the threshold corresponds to finding the two distribution functions that approximate the histogram of the difference values. Traditionally, the valley between two peaks is found and is chosen as the threshold dividing two distributions. However, in the real case as shown in Fig. 2, the histogram fluctuates heavily and it is difficult to find a threshold just by finding a valley. In [26], Wu et al. suggested that the histogram can be converted to a monotonic increasing histogram by accumulating the original histogram values, as shown in Fig. 3. In the cumulative histogram,. 10.

(23) 100. Pixel number. 80. 60. 40. 20. 0. 0. 32. 64. 96. 128. 160. 192. 224. Absolute difference value. Fig.2. The histogram of a difference image the problem of finding a threshold can be simplified to finding an intermediate point such that two straight lines which are interpolated by the start point, end point and the intermediate point can best approximate the cumulative histogram. Instead of using the ratio histogram in [26], we simply use the difference since the computational cost is much more expensive for the ratio histogram.. 80000. Accum ulative pixel num ber. 70000. T. 60000 50000 40000 30000 20000 10000 0 0. 32. 64. 96. 128. 160. 192. Absolute difference value. Fig. 3.The cumulative histogram and the interpolated lines 11. 224.

(24) 3.2.3 Background Buffer Update The next step in the segmentation algorithm is to update the background image (BI). In the background registration method [8-11], [15], [27], because the performance of the segmentation result relies on the correctness of the background dramatically, we need a robust method to retrieve and maintain the background image. The simplest way to obtain the background image is to capture the background beforehand. However, the background image may change slightly and gradually because the luminance may vary with time. In our algorithm, we dynamically update the background buffer using the difference mask. With the difference mask, the regions that are marked as ‘changed’ will not be updated to avoid distortion. Every time when a new difference mask is computed, the background image buffer at current time t (BIt) at position (i,j) is updated using the equation in Eq. (3) and Eq. (4). In the equations, the k(i,j) is the bias factor of the pixel (I,j) which accelerate the speed of background update. The symbol α is the weighting factor used in the update function Eq. (4). ⎧1 k(i,j) = ⎨ ⎩-1. if C Y (i, j ) > BI tY−1 (i, j ). (3). if C Y (i, j ) ≤ BI tY−1 (i, j. ⎧ BI tY−1 (i, j ) BI (i, j ) = ⎨ Y Y ⎩α ⋅ C (i, j ) + (1 − α ) ⋅ BI t −1 (i, j ) + k (i, j ) Y t. if DM (i, j ) = 255 if DM (i, j ) = 0. (4). When the system starts up, the background buffer is empty and a period of time for the background buffer initialization is required. With our background update equation, the initialization can be completed in a few frames. After the initialization process, we update the background buffer every 30 frames since background color does not change frequently. The gradual variation can quickly be updated to background buffer. Even if there is a sudden lighting variation when the clouds are dispersed and the sun is revealed, the update equation can also catch up the variation in a short period of time.. 3.2.4 Background Differencing After we obtain a background buffer, we can segment the moving objects from the 12.

(25) background. Unlike the way adopted in computing the difference image, we use the luminance and chrominance channels together instead of using luminance only. Because some moving objects look quite different compared to the background in their color, but the difference between the current frame and the background is almost zero when the chrominance information is discarded. In order to extract accurate object masks and not to miss any important moving objects, the chrominance information must be considered. Because the importance of the chrominance channels depends on the intensity of the luminance channel, we design a difference score function to evaluate the difference in YUV color space. Denote the different score as DS, the equations Eq. (5) through Eq. (8) show how to compute the different score. For a pixel (i,j), we first get the strongest luminance intensity among the current frame and the background image. Then, we decide the weighting factor of the chrominance based on the luminance intensity. Because the valid range of the luminance channel after conversion is from 16 to 235, we can subtract the luminance value from 16 and divide it into 11 levels, which are from 0.0 to 1.0. After that, we can use the weighting sum equation Eq. (7) to compute the color distance in the YUV color space. M (i, j) = max(C Y (i, j), BItY (i, j)). (5). ⎛ M (i, j) − 16 ⎞ w = floor⎜ ⎟ / 10 20 ⎝ ⎠. (6). DS(C(i, j), BI(i, j)) =. C Y (i, j) − BI Y (i, j) + w ⋅ C U (i, j) − BI U (i, j) + w ⋅ CV (i, j) − BI V (i, j) 1 + (2 ⋅ w). (7) BDI(i,j) = DS (C(i, j) , BI(i, j) ) ⎧0 BDM(i, j) = ⎨ ⎩255. (8) if BDI (i, j) > THb if BDI (i, j) ≤ THb. . (9). Sometimes an object enters a frame and then keeps stays in the same position on the xy plane. We call such kind of object as ‘stopped object’ since the motions in both the x and y. 13.

(26) direction are almost zero. Because the algorithm updates the inputted frame to the background for the unchanged regions, the color of the stopped objects will be updated to background buffer when they stop too long. In this case, object regions will be false alarmed because the background has been wrongly updated. Although this problem can be solved by lengthening the interval of background updates, the time needed to adapt to the luminance variations is also be lengthened. And thus it is a tradeoff and both of the cases must be taken into consideration. After we finish computing the background difference image using the difference score function, another dynamically selected threshold THd is applied to get the background difference mask, as shown in Eq. (9). The background difference mask extracted here indicates the moving object regions compared to the reference background image. However, the background difference mask contains a lot of noises and the object boundaries are not smooth. Thus, further filtering is required.. 3.2.5 Morphological Operation To smooth the object boundaries and remove the noises, two kinds of morphological operations are frequently used [8], [13], [23]. The closing operation is first used to fill the black holes inside the object masks and the opening operation is then used to remove the small noises that do not belong to the moving object. In our algorithm, the structure element of size 7x7 and 5x5 are selected for closing and opening operations respectively. In most of the cases, the smaller camera noises can be successfully filtered. However, larger regions caused by ghost effect are hard to remove out. Although larger structuring element may help, the computation cost will also be more expensive. And thus, instead of using larger structuring element, we will filter out these ghost regions in the video object tracking algorithm with temporal and spatial filtering. After the morphological operations, the object mask is smoothened and indicates the. 14.

(27) shapes and the positions of all the moving objects in the current frame. The individual object in the object mask is then extracted in the next process.. 3.2.6 Connected Component Labeling Algorithm and Size Filtering The tracking algorithm gets the extracted object mask as the input. However, the object mask simply indicates the positions and the shapes of all the moving regions without separate information. Thus, each individual object in the object mask must be extracted and assigned an identifying label. The connected component algorithm is a frequently used algorithm [8], [29] to achieve this work. For every pixel, it first examines the neighboring pixels and assigns that pixel a label. After that, pixels with the same label or equivalent labels are clustered together to form an isolated object. Because there are some large noise and ghost regions which are hard to be completely removed out, the size filtering must be performed after the labeling process. The size filtering process filters out those regions which are smaller than a predefined threshold. The objects that are not filtered out are called the object-of-interests and these objects will be tracked in the tracking algorithm.. 15.

(28) 3.3 Video Object Tracking The second module in our system is the video object tracking module. Although the object-level information can be extracted via the segmentation of video objects, the higher level object semantics can only be extracted from the object trajectories. Thus the object tacking process is the key role toward the semantics analysis and video abstraction. Our tracking algorithm gets the extracted object masks as the input and tracks all the objects to get the object trajectories. The tracking algorithm can be divided into several sub-modules and will be presented later. The diagram is shown in Fig. 4. P re v io u s S e g m e n ted O b je c ts. C u rre n t S e g m e n te d O b je cts. O b je ct M a tc h in g A lg o rith m. O c clu sio n O b je cts. O b je ct T ra je cto rie s. T em p o ra l F ilterin g. S m o o th e d O b je ct T ra je cto rie s. Fig. 4. The process diagram of the tracking algorithm In the tracking algorithm, the object information, such as mass center and motion, is first gathered for each detected object. We use a simple but effective matching function based on the motion and will be presented in details later. Based on the observation, the occlusion can happen inside the camera view or outside 16.

(29) the camera view. For the first condition, we can observe the occlusion of the objects. For the second condition, the objects are occluded when entering the camera view. According to Jung [15], we define the first condition as EXPLICIT OCCLUSION and the second condition as IMPLICIT OCCLUSION. The occlusion events are detected and the objects after splitting are tracked and matched using the motion information. After the matching of the objects, the object trajectories are smoothened in the temporal filtering process. The smoothened trajectories are then input to the video abstraction algorithm to generate abstractions.. 17.

(30) 3.3.1 Matching Function The object tracking algorithm tracks the object trajectories by matching the current video objects to the previously tracked video objects. In the literature of object tracking, some algorithms [15-17] adopt the Kalman Filtering to estimate and track the objects. It is appealing because it recursively estimates the object states and updates the predictions. In the ideal situation, when the moving paths are very smooth and the object masks are very accurate, the prediction error converges quickly because both the measurement error and the process error are small. However, the detected object boundaries may contain some errors due to the clutter scenes in the real environments, and the measurement errors thus become large. In addition, the path of a moving object is not as smooth as we expected. For example, if we connect the mass centers of a walking person, the connected path looks like zigzag rather than a straight line because all the actions such as waving of hands and striding affect the mass centers significantly. Under this condition that both the measurement error and the process error are high, the prediction error may not converge quickly. Thus, it is difficult to track and handle some complicated conditions like object occlusions due to the uncertainty of the estimation. Due to these reasons, we utilize the motion information to match objects in our object tracking algorithm. We use a simple but effective motion distance evaluating function to compute the motion distance for objects matching. Let us denote the ith segmented video object at time t as VOti. Suppose there are n current video objects and m previous objects, which belong to m tracked trajectories. We can know that all the m previous objects VOt-1i have been tracked at time t-1 and thus the motion vectors are known at time t and denoted as MVt-1i. Besides, the mass center of the n current objects and m previous objects are also known, which are denoted as XYt-1i and XYti respectively. The computations of motion distance function (MVD) are shown in the equations Eq. (10) and Eq. (11). In the equations,. 18.

(31) the mv(i, j) is the object motion vector between current object i and previous object j and the Θ is the included angle between mv(i,j) and MVt-1j.. (. mv ( i , j ) = XY t i − XY t −j 1 MVD (VO ti , VO t j−1 ) =. ). MV t −j 1. (10) 2. + (mv ( i , j ) ) − 2 ⋅ MV t −j 1 ⋅ mv ( i , j ) ⋅ cos( θ ) 2. (11) The motion vector distance function takes both the position and the moving direction into evaluation. The motion vector distance function can be further explained in its geometric meaning. As shown in Fig. 5, the geometric meaning of the MVD is the length of differencing motion vector. And the equation shows that both the position and the moving direction are evaluated in the equation.. MVt-1- j VOt-1j. θ mv(i,j). MVt-1j. | mv(i, j) – MVt-1 j | =MVD(VOti – VOt-1jj ). (a). (c) Fig. 5 (a) video object j in time t-1- and the tracked motion vector. mv(i,j). (b) video object i in time t, the mv(i, j) is. VOtj. computed using the Eq.9 (b). (c) motion vector distance. Fig. 5. (a) video object j in the time t-1 and the tracked motion vector; (b) video object in time t, the mv(i,j) is computed using Eq. (9); (c)motion vector distance To match the n current objects to the m previous objects, the matching function first builds up an m by n table and computes the motion vector distance in each table entry. The matching function then picks up the entry that the motion vector distance value is the minimum. If the minimum value does not exceed a predefined threshold THmatch, the. 19.

(32) corresponding current object and previous object in that selected entry are matched. Note that the matching function is a 1-to-1 function, the matched current object and previous object cannot be matched again. The matching function runs iteratively until the selected motion vector distance exceeds the threshold or either the current objects or the previous objects are create a MVD n by m table T for the m previous objects and n current objects COsize = n; POsize = m; for(i=0; i<n; i++) { for(j=0; j<m; j++) { T = MVD( VOti, VOt-1j ); } } while(COsize >0 && POLsize >0) { min_Value = the minimum MVD value in T; min_Entry = the entry T[x,y] that has minimum MVD value; if(min >= THmatch) break; match VOtx to VOt-1y; COsize --; POsize --; Delete the row T[x,*] in T; Delete the column T[*,y] in T;. } end of matching algorithm;. Fig. 6. Pseudo code of the matching function all matched. Fig. 6 shows the pseudo code of the matching function.. 20.

(33) 3.3.2 Object States Before entering the object matching algorithm, we must introduce the object states associated with the tracked objects. In the tracking process, because there are several object events such as the appearance, the disappearance, the occlusion and the split, we need additional flags to indicate the current state of an object. Therefore, we define two object state flags, the OCC_STATE and the OBJ_STATE. The OBJ_STATE indicates the object condition in its life cycle. The OBJ_STATE has three states: NORMAL, DYING and DEAD. Fig. 7 shows the state transition graph. The life cycle of an object starts when the object first appears No matching object in current frame New object appears. NORMAL. DYING Match an object in current frame. DEAD Being in DYING state more than p frames. Fig. 7. The state transition diagram of the OBJ_STATE in the frame and then the state goes to the NORMAL state. Because sometimes when the scene is clutter or the moving object is small, the object may be missed or be filtered out in the segmentation process and the moving object may disappear temporally. In traditional approaches, the original object may be considered disappeared and a new object entry will be created under this condition. However, in our human’s perception, there should be only one object. Therefore, instead of considering the temporally disappeared object dead directly, we let the object go to the DYING state first. When the object in the DYING state cannot find a match in the next p frames (say three), we will think the object is really disappeared and let the object go to the DEAD state. On the contrary, the object goes back to the NORMAL state when a good match is found before the time limit.. 21.

(34) The other state flag is the OCC_STATE, which indicates the relationship to other objects. The OCC_STATE has three states: NORMAL, COLLISION and OCCLUSION. The state transition diagram is shown in Fig. 8. When a new object appears, the OCC_STATE goes to Split. New object appears. Possible to collide. NORMAL. Occlude. COLLISION. OCCLUSION. Impossible to collide Split. Fig. 8. The state transition diagram of the OCC_STATE the NORMAL state. During the tracking process, each time when a new object is matched, our algorithm examines the possibility that whether this object will collide to other objects or not in the next few frames by estimating the object position. If it is possible to collide, then the object goes to the COLLISION state. When an object in previous frame is left unmatched after the matching process, the COLLISION state can be used to judge if an occlusion occurs or an object disappears since both of the cases will lead to failure in matching a previous object to a current object. If it is in the case that the object occludes the other objects, then it goes to the OCCLUSION state. On the other hand, if the object in the COLLISION state will not collide with any other objects, it goes back to the NORMAL state. Similarly, for the current object that fails to match any previous objects, the OCCLUSION state can be used to judge if the objects with explicit occlusion split to two or if a new object appears. Once the occluded object splits, the state goes back to the NORMAL state. However, in the case of implicit occlusion, the objects occlude outside the camera view and the occlusion event cannot be observed. Therefore, the implicitly occluded objects are not in the OCCLUSION state and the OCC_STATE remains unchanged when the objects split. With these indicating states, the tracking algorithm can handle complicated conditions without ambiguity. In the next section, we will show how the object matching algorithm 22.

(35) utilizes these states to match objects.. 23.

(36) 3.3.3 Objects Matching Algorithm The object matching algorithm in the sub-module in Fig. 4 tries to find matches for the objects using the matching function. Because our algorithm tries to handle various conditions, simply matching the objects detected in current frame to the objects detected in the previous frame is not enough. In our matching algorithm, the process is divided into several stages. For short, the objects detected in the current frame and the objects detected in the previous frame are denoted as CurrObj and PrevObj respectively, and the diagram is shown in Fig. 9. P re v io u s O b je c ts. C u rre n t O b je c ts. M a tc h in g F u n c tio n. U n m a tc h e d C u rre n t O b je c ts. U n m a tc h e d P re v io u s O b je c ts. If n u m b e r o f u n m a tc h e d c u rre n t o b je c ts > 0 ?. If n u m b e r o f u n m a tc h e d p re v io u s o b je c ts > 0 ?. S p lit O b je c ts M a tc h in g. O c c lu d e d O b je c ts M a tc h in g. If n u m b e r o f u n m a tc h e d c u rre n t o b je c ts > 0 ?. If n u m b e r o f u n m a tc h e d p re v io u s o b je c ts > 0 ?. E s tim a te d O b je c ts M a tc h in g. A ppend E s tim a te d O b je c ts. U p d a te. If n u m b e r o f u n m a tc h e d c u rre n t o b je c ts > 0 ?. U p d a te. C re a te n e w o b je c ts e n try. O b je c t T ra je c to ry. U p d a te. U p d a te R e fre sh T ra je c to ry. Fig. 9. The process of the whole matching algorithm In the matching algorithm, the objects detected in current frame and the objects detected in previous frame are taken into the matching function to find a best match. In our algorithm, we use the object trajectory to stores the tracked objects in each frame for each object entry. 24.

(37) So, the current objects that found matches here are appended to their object trajectories respectively. Because the matching function terminates when no more good matches could be found, there could be some current objects and previous objects left unmatched and they are stored in the CurrObjRest and PrevObjRest respectively. The current objects that left unmatched may be resulted from the appearance of new objects, the split of occluded objects or temporally disappeared objects revealed. Similarly, the previous objects that left unmatched may be resulted from the occlusion or the disappearance of the objects. Sub-modules designed to handled the events and condition will be presented in detail. The whole process of the matching algorithm will be presented after these sub-modules are presented.. 3.3.3.1 Occluded Objects Matching For those previous objects that left unmatched, our matching algorithm first tries to find if there is any objects occlusion events. As mentioned earlier, since both the conditions of occlusion and disappearance of objects will leave the previous objects unmatched, the COLLISION state must be used to judge if there is an occlusion event. Fig. 10 illustrates the relationship of occluded objects and Fig. 11 shows the diagram of occlusion objects matching process. Assume that the video objects VOt-11 and VOt-12 in time t-1 may collide with each other in the future, so both the objects are in the COLLISION state and we define that these two objects are in the same ‘collision group’. In addition, we also assume that the two objects occlude at time t and thus only one isolated object is detected. As described earlier, the current. VOt-11 COLLISION. VOt-12 COLLISION. VOt1 (a). (b). Fig. 10. The relationship of occlusion objects; (a) Before occlusion; (b) after occlusion 25.

(38) Unmatched Previous Objects in Collision state Current Objects that has matched to the Previous Objects in Collision state. Unmatched Previous Objects. Matching Function Update. Matched Objects. Update. Object Trajectory. Occlusion Group. Fig. 11. The occluded objects matching process. objects and the previous objects are first taken into the matching function and thus one of the objects in t-1, says VOt-12, is matched to the VOt1 at time t. After the 1-to-1 matching function, the video object VOt-11 is left unmatched. To handle the occlusion events, our algorithm first checks the possibility of objects occlusion by examining the COLLISION state of the unmatched previous object, for example the VOt-11 in Fig. 10, and the previous object in COLLISION state is taken into the matching function. Then, the current object that has being matched to the previous object which is also in the ‘collision group’ is also taken into the matching function. If a match is successfully found, it implies that there is indeed an occlusion since the current object can match to the previous objects in COLLISION state and it satisfies the real situation of the objects occlusion. Once the event of the objects occlusion is detected, both the occluded objects go to the OCCLUSION state and they share the same object trajectory until they split into two. We define that these occluded objects are in the same ‘occlusion group’. Note that because the individual motion is required to track the each object when the occluded objects split, our algorithm keeps estimating the individual trajectory when the objects are occluded. The tracked objects are appended to respective object trajectories.. 3.3.3.2 Split Objects Matching. 26.

(39) The split objects matching process handles the event of object split and matches the split objects. The relationship of the split objects and the diagram of the process are shown in Fig. 12 and Fig. 13. Assume the object VOt-11 in time t-1 is merged from two objects and they split. VOt2. VOt-11 OCCLUSION. VOt1 (a). (b). Fig. 12. The relationship of split objects; (a) Before splitting; (b) after splitting Unmatched Current Objects. Unmatched Current Objects Matching Function. Previous Objects. Occlusion Objects. Matched Objects. Update. Matching Function. Unmatched Current Objects. Matched Objects. Update Object Trajectory. Occlusion Group. Fig. 13. The split object matching process into two objects, VOt1 and VOt2, at time t. According to the matching function performed on current and previous objects, the previous object VOt-11 is matched to one of the current objects, for example VOt2. Therefore, the previous object VOt1 is left unmatched. Remember that there are explicit occlusion and implicit occlusion. Therefore, the conditions of split event become more complex. Because we cannot judge the possibility of split event simply with the OCCLUSION state, we need to divide the split object matching process into two steps. In the first step, we try to detect the split events from explicitly occluded objects. First,. 27.

(40) we find all the objects which are in the OCCLUSION state from the object trajectory lists and the objects that have not been matched to current objects are picked. Take Fig. 12 for example. Although there is only one previous object, there is another object trajectory in OCCLUSION state. Then, the unmatched current objects and the objects we picked are taken into the matching function. Note that because the estimated motion for each individual object is used here. If a good match is found, it implies some occluded objects now split because the previously occluded objects now match to two objects individually. In this case, the split objects go back to the NORMAL state and the occlusion group for the occluded objects is deleted. Then the tracked current objects are appended to their respective object trajectories. If the number of unmatched current objects is not zero, the second step is performed. In the second step, we try to detect the split events from implicitly occluded objects. The process is quite similar. However, instead of finding objects in OCCLUSION state from the object trajectories, all the previous objects are used for matching here since we cannot find any OCCLUSION flag in implicitly occluded objects. If a good match is found, the implication is that one previous object matches to two current objects, which means the split event of implicitly occluded object. In this case, our algorithm creates a new object trajectory for the object that splits out and the tracked objects before splitting are duplicated.. 3.3.3.3 Estimated Objects Matching Because we do not think that the object is dead soon after it disappears, we append an estimated object to its object trajectory for later matching process. We use the object information in the past few frames to predict the position and the motion of the estimated object. When the temporally disappeared object now reveals again in the current frame, therefore, we must pick up the estimated objects for matching. Fig. 14 shows the situations that an estimated object is used and Fig. 15 shows the diagram of the estimated object matching process.. 28.

(41) ? VOt-21. VOt1 (a). (c). (b). Fig. 14. The condition that estimated object is appended (a) Time t-2; (b) Time t-1, the object disappears, and an estimated object is appended ;(c) Time t, the object reveals again and is going to be match to the estimated object. Unmatched Current Objects. Estimated Objects. Unmatched Current Objects. Matching Function. Matched Objects. Update. Object Trajectory. Fig. 15. The estimated objects matching process. This matching process first finds all the object trajectories in the DYING state and picks up the estimated objects from these object trajectories. Then, these estimated objects and the unmatched current objects are taken into the matching function. If an estimated object successfully matches to an unmatched current object, the current object is appended to the object trajectory of that estimated object and the OBJ_STATE goes back from the DYING state to the NORMAL state. After all the sub-matching modules are presented, we now illustrate the process of the matching algorithm. The current objects and the previous objects are taken into the matching function, and the matching function terminates when no more good match can be found. If there are any events such as appearance, disappearance splitting and occlusion of objects, some current objects and previous objects will be left unmatched. As shown in Fig. 9, for the unmatched previous objects, first the matching algorithm performs the occluded objects matching process to check whether there are any objects occlusion events and tries to find. 29.

(42) matches. If there are still any previous objects that cannot find a match, we think that there are objects disappeared. For the object trajectories of these unmatched previous objects, we make the OBJ_STATE go to the DYING state and estimated objects are appended. For the unmatched current objects, first it goes to the split objects matching process to check whether there are any object split events and try to find matches. The current objects that still cannot find matches will then go to the estimated objects matching process. After the estimated objects matching process, we will consider the rest of current objects as new objects and new entries for the object trajectories will be created. Finally, the matching algorithm goes to the refresh trajectory function. In this function, the objects in DYING state are first examined. If the object stays in the DYING state too long, we will consider that the object is really disappeared for\ever and let it go to the DEAD state. After that, the motion vector of each moving object is re-computed using the position of the newly tracked object position. Finally, based on the tracked object positions and the computed motion vectors, the function examines if any two objects may collide with each other in the near future. Each time after the matching algorithm finishes matching and processing all the objects, the tracking algorithm will pass the object to temporal filtering process.. 3.3.4 Temporal Filtering The temporal filtering process here is designed to filter out the ghost effect. Because ghost usually appears and disappears very quickly, we can use the temporal filtering to filter out the ghost objects. In our algorithm, we will not think a detected and tracked object valid unless it survives more than a time period. In other words, an object that goes to DEAD too soon after it appears will be filtered out and excluded from the key objects selection process in the video abstraction algorithm.. 30.

(43) Another objective of the temporal filtering process is to smooth the object motion. In the trajectory of a non-rigid object like a walking person, because the connected path of the mass center fluctuates like zigzags, the precision of the analysis of moving direction is seriously affected. Therefore, the paths of the moving objects need to be smoothed in the temporal filtering process. Fig. 16 shows the zigzag-liked moving path and the smoothed moving path. The solid lines represent the true motion vector by connecting the mass centers and the direction of the dash line represents the moving direction after temporal filtering.. Fig. 16. The path of the mass center of a walking person. 31.

(44) 3.4 Video Abstraction Algorithm The last module in the system is the video object abstraction module. The video abstraction is generated using the video abstraction algorithm by selecting the key frames with meaningful semantics. Because the moving objects are the most important parts in surveillance videos, the selection of important key frames is equivalent to the selection of important key objects. Therefore, in our video abstraction algorithm, we will analyze the tracked object trajectories and detect object events to extract representative key objects. Although the best and the most representative key objects of an object trajectory can be selected after the life cycle of that object is terminated, this kind of approach is not suitable for a real-time tracking system like ours. In order to achieve on-line alarming on real-time tracking system, the key objects must be selected near real-time, which means the delay must be bounded and very small. Therefore, every time a new frame comes in, our algorithm examines the current tracked object in each trajectory and selects it as a new key object if it is representative enough for its trajectory. One of the criteria for key object selection is based on the object events which are representative for some object states or objects relationships at some time instant. Such events may raise our human’s interests. There are some important object events in general domains, such as appearance and disappearance. Besides, the motions and the positions of the object may also be used as the selection criteria. For example, we may have interests and pay more attention when a new object appears or the moving direction of the object changes because they can represent significant events. Therefore, the analysis of the object trajectories to extract this specific information is required. The diagram of the abstraction algorithm is shown in Fig. 17. There are three modules in this algorithm. The algorithm takes the object trajectories generated in the video object tracking algorithm as input. The abstraction will be generated by selecting the frames with. 32.

(45) key objects and output to clients.. Object Trajectories. Object State Analysis. Selected Major and Minor Key Objects. Object Trajectory Analysis. Fig. 17. The abstraction algorithm. 3.4.1 Object State Analysis The object state analysis process detects the general object events such as appearance, disappearance, occlusion and split of objects. Because we have handled and detected these events for object matching in the tracking algorithm, we can directly capture these events by examining the state transition of OBJ_STATE and the OCC_STATE of the objects. The only exception is that we do not directly extract the event when an object appears because the temporal filtering is applied to filter out the ghosts. Therefore, the events of object appearance will only be captured when the object survives for a period of time after it appears.. 33.

(46) 3.4.2 Object Trajectory Analysis The object trajectory analysis process tries to analyze the trajectories to find the featured objects as key objects. The featured objects are representative for the changes in the moving speed and direction, the position in the frame view or the object size. Every time when new an object is tracked, our algorithm compares the motion, position and size of that object to those object features of the previously selected key object. To evaluate the motion difference of current object and the previously selected key objects, the motion vector distance function described in section 3.3.2 is used. However, to avoid the zigzag-like paths for non-rigid objects to affect the analysis of motion direction, the motion vector after temporal filtering is used. Fig. 18 shows the analysis process.. MVof Previous selected key object. MVof Current tracked object. Motion Vector Distance. > THm ?. Position distance. Y. Select as Major Key object. > THp ?. Y. Output Key Obj. Select as Minor Key object. Fig. 18. The trajectory analysis and key object selection process. 3.4.3 Video Abstraction with Selected Key Objects After the object event detection process and the object trajectory analysis process, the abstraction can be output using the selected key objects. In our algorithm, we define two types of key objects: major key objects and minor key objects. The major key objects represent important event and are always exported. On the contrary, the minor key objects are less important and are exported only when there is no other key object exported recently. All the. 34.

(47) key objects selected in object state analysis process are major key objects. Besides, the objects which change in motion significantly are also selected as major key objects. The key objects which are selected using the position as the criterion are minor key objects. Fig. 19 shows how the algorithm selects the key objects to export.. Major Key Obj. Any Major Key Obj ?. Y. Export Key Obj to client. N. Minor Key Obj. Any Minor Key Obj ?. Y. Key obj Recently exported?. Fig. 19. The key objects exporting process. 35. Y. Export Key Obj to client.

(48) Chapter 4 System Architecture and Experiment Result. In this chapter, we will present the system for object-based video tracking and abstraction. In the section 4.1, we will first show an overview of the system architecture. In the sections 4.2 to 4.4, the experiment results of each module will be represented. The implemented system will be presented in the section 4.5.. 4.1 System Architecture Overview In this thesis, we implemented an object-based tracking and abstraction system on surveillance videos. The raw video data captured lively are input to our system and the process of object segmentation, object tracking and video abstraction are performed on-the-fly. The abstraction is used for on-line alarming at the client while the surveillance video can be monitored simultaneously. Except some predefined thresholds, all the initializations are done automatically without manual interactions. Fig. 20 shows the overview of the system.. Client Monitoring Video. Video Object Segmentation. Video Object Tracking. Video Abstraction. Surveillance Video. Alarming Abstraction. Fig. 20. System architecture overview. 36.

(49) 4.2 Experimental Results of the Video Object Segmentation In the segmentation algorithm, we use first 40 frames for background initialization before we start segmentation. In the morphological operations, the size of structuring element for closing operation is 7 by 7 and the size for opening operation is 5 by 5. The size threshold used for size filtering is 450 pixels. Fig. 21 presents the segmentation results of the ETRI_B clip at frame NO.95. Fig. 21(a) shows the original image and Fig. 21(b) shows the performance of using the luminance and chrominance together to segment video objects. For reference, Fig. 21(c) shows the segmentation result that only luminance channel is used. The results show that combining the luminance and the chrominance to segment object can improve the segmentation results a lot.. (a). (b). (c). Fig. 21. (a) original image; (b) segmentation result that luminance and chrominance channel are used; (c) segmentation results that only luminance channel is used. Fig. 22 shows the result of the clip “hall monitor” after applying the morphological operation. The small noises are removed and the black holes inside the objects are filled. The. 37.

(50) bigger noise regions left in Fig. 22 (b) will be filtered out in the size filtering process.. (a). (b). Fig. 22. (a) the segment result before the morphological operation; (b) the result after the morphological operation. Figs.23 through Fig. 27 show some results of the segmentation algorithm and the tested video sequences.. (a). (b). Fig. 23. Segmentation results of the clip speedway at frame (a) #585; (b) #673;. (a). (b). Fig. 24. Segmentation results of the clip hall monitor at frame (a) #114; (b) #273; 38.

(51) (a). (b). (c). (d). (e). (f). Fig. 25. Segmentation results of the clip ETRI_A a at frame (a) #95; (b) #120; (c) #360; (d) #470; (e) #567; (f) #617;. 39.

(52) (a). (b). (c). (d). (e). (f). (g). (h). Fig. 26. Segmentation results of the clip ETRI_B at frame (a) #168; (b) #378; (c) #1045; (d) #1267; (e) #1477; (f) #1969; (g) #2511; (h) #2753. 40.

(53) (a). (b). (c). (d). (e). (f). Fig. 27. Segmentation results of the clip ETRI_C at frame (a) #95; (b) #120; (c) #360; (d) #470; (e) #567; (f) #617;. 41.

(54) The results show that most of the noise regions are successfully filtered out. However, the ghost regions in Fig. 24(b) and Fig. 26(b) are not removed out because the size of these regions exceeds the filtering threshold. In the video clip of Fig. 26, because the sun light varies dramatically, large-scale of regions which are directly illuminated and the color of these regions thus change significantly. Due to this reason, many objects detected in this video clip are false alarmed. The other false alarmed region in Fig. 24(b) is resulted from the stopped object. Although the person wearing white pants in Fig. 24(a) does not completely stop, the motions in both the x and y direction are almost zero and thus the color of the object region is updated to background. Because the color of the background is distorted and is different from the real background color, the region is false alarmed. The table 1 is the statistics of the segmentation result. Four video clips of different environments and contents are tested. The first column is the total number of ground truth objects in all the frames of the video clip. The moving regions that can be clearly distinguished are selected as ground truth. The second column is the total number of the object detected after morphological operation and size filtering. The precision and the recall Table 1. Statistics of segmentation result. Ground Truth. Detected. False alarm. Miss. Precision. Recall. Hall Monitor. 463. 487. 25. 1. 94.86%. 99.78%. ETRI_A. 1365. 1401. 39. 3. 97.21%. 99.78%. ETRI_C. 526. 513. 6. 19. 98.83%. 96.38%. Speedway. 410. 354. 0. 56. 100%. 86.34%. Sequence name. precision. recall =. =. hits hits + false _ alarm. hits hits + miss. (12). (13). 42.

(55) are defined in the Eq. (12) and Eq. (13) respectively. The false alarms are mainly due to the stopped object and the lighting variation. The recall rate drops in the speedway sequence because the vehicles are very small when they are far away. Therefore, these vehicles are filtered out although they are detected after the morphological operation. However, these filtered out small objects would not affect the result of video abstraction since they are too small and contain little semantics.. 43.

(56) 4.3 Experiment Results of the Video Object Tracking In the tracking algorithm, the threshold THmatch used in the matching function is heuristically set to 50. The window size used in the temporal filtering is set to 5. Fig. 28 through Fig. 30 show the results of video object tracking algorithm. In order to check the result easily, objects which belong to the same trajectory are marked with an identifying label manually.. 1. 1. 2. (a). 1. 2. (b). 1. 2. (c). 2. (d). 2. 1. (e) Fig. 28. Tracking results of the speedway sequence (a) #530; (b) #545; (c) #560; (d) #575; (e) #590; (f) #605; 44. (f).

(57) 1. 1. (a). (b). 1&2. 2. 1. (c). (d). 1&2. 1&2. (e). 2. 1. (f). 2. (g). 45. (h).

(58) 3. 2. 2. (i). (j). 2&3. 2&3. (k). 2. (l). 3 3. (m) Fig. 29. Tracking results of the ETRI_C sequence (a) #420; (b) #450; (c) #475; (d) #478; (e) #486; (f) #491; (g) #494; (h) #505; (i) #540; (j) #570; (k) #576; (l) #586; (m) #589; (n) #605;. 46. (n).

(59) 1. 1&2. (a). (b). 3. 3. (c). (d). 3. 3. 3. (e). (f). Fig. 30. Tracking results of the ETRI_B sequence (a) #2190; (b) #2227; (c) #2500; (d) #2508; (e) #2518; (f) #2513;. The results show that the tracking algorithm successfully tracks the video objects and detects the occlusion events. Our algorithm also matches the split objects to the objects before occlusion correctly, as shown in Fig. 29. Besides, objects moving in different speed, for example the person who walks slowly (obj 1) and the person who runs quickly (obj 3) in Fig. 30, are all successfully tracked. The table 2 shows the statistics of the detecting and tracking of occlusion and split events. The results show that all the objects before and after the. 47.

(60) occlusions are matched perfectly. The only failure in detecting occlusion events happens in the sequence ETRI_B, which is shown in Fig. 30(a) and Fig. 30(b). Because the object 2 is occluded by the object 1 in the first frame when it enters the camera view, it is impossible to detect the occlusion under such condition. Table 2. Statistics of the tracking and detecting of occlusion events. Number of occlusion events occurred. Number of occlusion events detected. Number of the matching failures after the split. ETRI_A. 3. 3. 0. ETRI_B. 4. 3. 0. ETRI_C. 2. 2. 0. Sequence name. The statistics in table 3 show the result of the tracking algorithm. Because the sequence “ETRI_C” is too long, we only took the first 3000 frames. Note that when a person is going to walk behind the tree, we consider that the trajectory before he is covered and the trajectory after he is uncovered are two different trajectories since that we can observe the object indeed disappeared for a while. We can see that many several ghost regions can be filtered out with the temporal filter. The false alarms are mainly due to the stopped object effect and the light variation which keeps changing severely. The only missed object is the object which is occluded at the first frame it appears and thus fails to detect the object. Table 3. Statistics of the tracked trajectories. Ground truth. Tracked. Tracked after temporal filtering. speedway. 6. 6. 6. 0. 0. Hall monitor. 2. 7. 3. 1. 0. ETRI_A. 10. 23. 15. 5. 0. ETRI_B. 16. 38. 22. 7. 1. ETRI_C. 22. 37. 24. 2. 0. Sequence name. 48. False alarmed trajectory. Missed.

(61) The results can show that our algorithm can robustly track almost all the trajectories and reason the occlusion and split events. Although some false alarms exist, the ghost regions caused by the lighting effect and the stopped object effect can also be filtered effectively. The robustness of our tracking results can be used to extract key objects for abstraction later.. 49.

(62) 4.4 Experiment Results of the Video Abstraction In the abstraction algorithm, the thresholds THm for the MVD function to decide whether there is significant change in motion is set to 2 pixels. And the THp used to decide whether the spatial distance is large enough is set to 80 pixels. The interval for selecting minor key objects is 60 frames or a video sequence has 30fps. Fig. 31 and Fig. 32 show the selected key objects for the detected occlusion and split events. The objects in continuous frames are listed and the selected key objects for the specific event are marked using a rectangle.. (a). (a). (b) (c) (d) (e) (f) Fig. 31. Selected key objects for the detected occlusion event. (b) (c) (d) (e) Fig. 32. Selected key objects for the detected split event. (f). Fig. 33 shows the selected key objects for the 33rd object in the sequence “ETRI_B”. The person first walks into the frame (a) and slightly changes the direction (b). After a period of time, because the distance of the object positions in (b) and (c) are big enough, the object in (c) is also selected as key object. After a while, he starts to rush and the key objects are selected in (d) and (e). Finally, the object in (f) is disappearing and is selected as key object. Fig. 34 shows the 30th object in the sequence “ETRI_B”. Because the person keeps jumping in the camera view and the movements are very heavy, thus it is selected as key objects. Fig. 35 through Fig. 37 show parts of the generated abstraction of the video sequence 50.

(63) of “ETRI_C”, “hall monitor” and “speedway”. (a). (b). (c). (d). (e). (f). Fig. 33. Selected key objects for the ETRI_B sequence (a) object appears; (b) change in motion; (c) change in position; (d) change in motion; (e) change in motion; (f) object is disappearing. (a). (b). (c). (d). (e). (f). (g). Fig. 34. Selected key objects for the ETRI_B sequence (a) object appears; (b) change in motion; (c) change in motion; (d) change in motion; (e) change in motion; (f) change in motion; (g) object is disappearing; 51.

(64) (a). (b). (c). (d). (e). (f). (g). (h). (i). (j). (k). (l). (m). (n). (o). Fig. 35. Parts of the abstraction of the ETRI_C sequence (a) object appears; (b)change in position; (c) object appears; (d) change in position;(e)occlusion event; (f)change in motion; (g) split event; (h)object is disappearing; (i)change in position; (j) change in motion; (k)object appears; (l)change in position; (m) change in motion; (n) change in motion; (o)occlusion event. 52.

(65) (a) (a). (b) (b). (c) (c). (d) (d). (e) (e). (f) (f). (g) (h) (i) (g) (h) (i) Fig. 36. Parts of the abstraction of the hall monitor sequence (a)object appears; (b)change in motion; (c)change in motion; (d)change in motion; (e)object appears; (f)object is disappearing; (g)change in motion; (g)change in motion; (g)object is disappearing;. 53.

(66) (a). (b). (c). (d). (e). (f). (g). (h). (i). Fig. 37. Parts of the abstraction of the speedway sequence (a)object appears; (b)object appears; (c)change in motion; (d)change in motion; (e)change in motion; (f)change in motion; (g)change in motion; (g)change in motion; (g)change in motion;. 54.

(67) Table 4 shows the statistics of the generate abstractions. The results show that the generated abstractions are very compaction and the object-level semantics and events are also represented in the abstractions. In the next section we will show how to integrate the abstraction algorithm to provide on-line alarming. Table 4. Statistics of the abstraction. Object abstraction. Selected key VOPs.. Total VOPs. Object 1 (ETRI_A). 9. 240. Object 7 (ETRI_A). 7. 295. Object 3 (ETRI_B. 10. 140. Object 30 (ETRI_B). 7. 59. Object 33 (ETRI_B). 6. 167. Object 5 (ETRI_C). 7. 176. Object 6 (ETRI_C). 15. 171. Object 3 (speedway). 6. 128. Object 1 (hall monitor). 5. 235. 55.