中華大學

(1)

中華大學碩士論文

題目：人群中之行人追蹤與軌跡分類

Tracking and Trajectory Classification in Crowd

系所別：資訊工程學系碩士班學號姓名：M09602004 甘偉欣指導教授：連振昌博士

中華民國九十九年二月

(2)

摘要

在此研究中，為了在擁擠場景中偵測獨立的目標物、分析群體移動所產生的軌跡資訊，我們提出的兩個方法分別是在擁擠環境中偵測、追蹤獨立目標物，以及將群體運動軌跡進行分類。第一個所提出的方法是以 corner 特徵點為基底，並且進行由粗糙至細緻的獨立目標切割演算法；首先，利用 C-Means 演算法將特徵點粗略的分群，接著透過 spatial-temporal shortest spanning tree 在每一個移動分群內切割出更精確的獨立目標物，最後運用 corner 特徵點繼承的概念來追蹤所有移動目標物。第二個所提出的部分為應用 longest common subsequences 演算法於實驗環境中，並自動地估算運動軌跡間相似性；所有特徵點軌跡透過空間及時間上的相似性測量，逐一地被分類到最適當的類別中。從實驗的結果發現，在擁擠環境中分割獨立移動目標物的準確度可高於 90%以上，效能部分可每秒約處理 6~8 張 frame。

(3)

致謝

很幸運的能夠有機會在中華大學資訊工程學系碩士班中求學，度過值得回憶的研究生活並順利取得到碩士學位。在這之中，首先我要感謝我的指導教授連振昌老師在這兩年半中不僅在專業知識上對我的幫助與教導，在人生的態度以及做人處世的道理上更是讓我獲益良多。這兩年多來在連老師ㄧ點一滴的栽培下，

才得以順利完成研究所的學業。

其次，我要感謝智慧型多媒體實驗室中許許多多朝夕相處夥伴們，尤其感謝揚凱、正達、建程、靈逸、清乾、昭偉、懷三、岳珉、佐民學長和雅麟學姐在論文研究上給予的莫大指導和幫助，正崙和明修同學間相互的支持與照顧，以及信吉和琮瑋學弟、雅婷和佩蓉學妹的協助和陪伴，使得我這兩年的研究生生活得以過的多彩多姿。

最後，以此論文獻給辛苦培育我長大的雙親甘水添先生和鄭碧芬女士，還有我親愛的姐姐甘芳慈，謝謝你們在我就讀研究所這些日子的支持與鼓勵，還要感謝那曾經、一直陪伴在我身邊的好朋友，謝謝你們給我的一切。

(4)

目錄

摘要...一致謝...二目錄...三第一章簡介...四第二章連續影像中擷取特徵點以及追蹤特徵點...七第三章人群中之行人追蹤系統...八第四章人群中之軌跡分類系統...九第五章實驗結果...一〇第六章結論...一一英文附錄...一二

(5)

第一章簡介

隨著電腦技術的迅速發展，一些以影像為基礎的應用，例如影像分類[1]、影像搜尋[2]、影像檢索[3]、影像監控[4]等的技術需求越來越大。紐約的時代雜誌 [5]曾報導過，經由計算總共約有 1800 支的攝影機架設於倫敦火車站的周圍。因此在我們的論文中，主要的研究方向為發展新的影像監控技術。

在傳統的 blob-based 物件偵測系統中，應用許多截取移動物件的方法，例如背景相減法[6]、光流法[7, 20-21]、畫面差異分析法[8]、編碼法[9]等。然而背景相減法對於光線變化過於敏感，容易偵測誤判，光流法用來獨立追蹤物體上的低階特徵，畫面差異分析法也容易因為背景變化導致目標物偵測不完整，而編碼法不適用於擁擠的監控環境。

而傳統的目標追蹤系統[10, 11]中，往往侷限發展在單純的環境下。例如 Zhao and Nevatia [12, 13]所提出的橢圓人體模型，當物件在擁擠場景中並且可見的部 分為極少許狀況時並不適用。而 Liangfu et al. [14]所提出的物件追蹤系統則是修 改了傳統群中心位移演算法，此做法產生較佳的追蹤結果。

在一般的影像監控應用中，最大的挑戰即是在擁擠環境中進行目標物追蹤。

嚴重的遮蔽現象會導致傳統的 blob-based 物件偵測/追蹤方法失敗，導致得主要原因有兩個為監控的移動目標物可見度過小，以及頻繁的遮蔽現象使得切割個體有難度。因此使用 point-based 的特徵時可避免上述的狀況，即使在擁擠人群的場景中發生遮蔽現象，仍然有機會保留目標物外觀上的部分資訊，目標物在畫面上不至於完全消失。

為了在擁擠人群場景中偵測獨立目標物，以及分析人群移動的軌跡資訊，我

(6)

們提出了以下兩個方法：第一個為人群中之軌跡分類，如圖一左半部系統流程圖中所示，不斷地追蹤 corner 特徵點直到消失為止，我們應用 LCSS 演算法[15]來比對特徵點軌跡與預設軌跡之間的相似性，並將所有的特徵點軌跡皆進行分類。

第二個為人群中之行人追蹤，如圖一右半部系統流程圖中所示，我們提出一個以 corner 特徵點為基底，步驟從粗糙至細緻的獨立目標物切割演算法，使用 C-Means 演算法產生粗糙的分群，並在分群內建構 spatial-temporal shortest spanning tree 來修正出更精確的獨立移動目標物，最後利用特徵點繼承的概念來建構出目標物追蹤機制。

接下來的論述如下，第二章描述如何在影像中擷取特徵點以及追蹤特徵點，

第三章描述人群中之行人追蹤系統，第四章描述人群中之軌跡分類系統，第五章是實驗結果，第六章為結論。

(7)

Video Sequence

Feature Points Detection

Feature Points Tracking

Displacement Length Filter

Static Feature

Points Dynamic

Feature Points Over The Threshold

Under The Threshold

All Tracks Compared with Setting Tracks in LCSS

Algorithm

According the Trajectory Type to Display Different

Output

C-means Clustering

Construct a Spanning Tree in All Clusters

Individual Segmentation with Spatial-Temporal

Relationship

Tracking Result Trajectories

Cluster Result

圖一 System flowchart. The left side describes the method of classification of Crowd Trajectories. The right describes the method of tracking in crowd.

(8)

第二章連續影像中擷取特徵點以及追蹤特徵點

此章節中，將會說明使用 point-based 特徵點的優勢，此特徵點須克服影像的變化、旋轉及尺寸的縮放。所截取出的特徵點效能將會影響目標物追蹤機制，因此 2.1 節中描述為何使用 point-based 的特徵，並介紹適合我們系統使用的特徵點。2.2 節中說明如何擷取影像中的 corner 特徵點。2.3 節中描述如何在連續影像中追蹤 corner 特徵點。

(9)

第三章人群中之行人追蹤系統

在此章節中，將詳細描述於擁擠環境下如何分割獨立目標物並進行追蹤。3.1 節描述如何記錄 corner 特徵點的追蹤所產生的軌跡資訊。3.2 節敘述如何使用 C-Means 將動態特徵點做初步的粗糙分群。3.3 節中敘述在所有粗糙分群中建構 Spatial-Temporal Shortest Spanning Tree，透過時間和空間的軌跡一致性來過濾分群內的特徵點，進而得到較準確的獨立目標物切割。3.4 節中敘述目標物之間交叉經過可能會造成分群內有特徵點繼承錯誤的狀況，因此採用投票機制來過濾分群內不同行進方向的特徵點。3.5 節中利用 corner 特徵點連續追蹤的特性，延伸出連續畫面的特徵點繼承觀念，並且建構出穩定性高的獨立目標物追蹤機制。

(10)

第四章人群中之軌跡分類系統

在此章節中，我們應用 longest common subsequences [15]演算法於自動分析人群運動軌跡。4.1 節中為演算法的前處理，敘述如何記錄 corner 特徵點的追蹤所產生的軌跡資訊。4.2 節中為敘述 LCSS 演算法內部細節，利用 LCSS 演算法分析所有特徵點軌跡資訊，並與預設軌跡進行空間及時間上的相似性比對，進而達到軌跡分類的效果。

(11)

第五章實驗結果

在此章節中，分別使用兩個系統測試於不同的擁擠環境。其中人群中之行人追蹤系統的實驗結果數據與其他兩個方法進行比較，並分別圖解不同階段的處理步驟。從實驗結果得知，在擁擠環境中分割獨立移動目標物的準確度可高於 90%

以上。在人群中之軌跡分類系統部分，測試環境為中華大學校園人行道，在擁擠環境中截取特徵點軌跡並進行相似性比對，進而利用不同顏色的行進箭頭來表示不同分群。

(12)

第六章結論

本論文提出兩個在擁擠環境下以特徵追蹤機制為基礎的智慧型分析系統。在子系統人群中之行人追蹤中，我們所提出的粗糙至細緻物件切割法，可修正粗略的分群結果進而得到更完整的目標物資訊；其中的 Spatial-Temporal Relationship、特徵點繼承概念目標物追蹤機制等，改善了許多傳統視訊監控常遇到的問題。第一，當人群行走於擁擠環境時，目標物無法準確地被偵測出來。第二，位於遮蔽現象嚴重的人群擁擠環境中，獨立目標物的追蹤有著一定難度。從實驗的結果發現，我們的系統在擁擠環境中分割獨立移動目標物之準確度可高於 90%以上，效能部分可每秒約處理 6~8 張 frame。在人群中之軌跡分類子系統的實驗部分，我們應用 longest common sequences [15]比對演算法於中華大學校園人行道，所產生的運動方向可有效地表示人群行進資訊，此系統的應用充滿了實用性與未來性。在子系統人群中之行人追蹤的未來中，我們將會積極提高目標物分割、追蹤的準確率，並且試圖讓系統可以適用於更多場景當中。

(13)

英文附錄

(14)

Tracking and Trajectory Classification in Crowd

Prepare by Wei-Hsin Kan Directed by Dr. Cheng-Chang Lien

Computer Science and Information Engineering Chung-Hua University

Hsin-Chu, Taiwan, R.O.C.

February, 2010

(15)

Abstract

In order to detect each individual target in the crowded scenes and analyze the crowd moving trajectories, we propose two methods to detect and track the individual target in the crowd and classify the crowd motion trajectories. First, a coarse-to-fine individual segmentation approach based on the corner point’s extraction and tracking is proposed. The dynamic feature points are roughly clustered by the C-Means algorithm and then a spatial-temporal shortest spanning tree is proposed to segment each individual target in each moving group and each target is tracked with the concept of points’ inheritance. Second, the method of the longest common subsequences is applied to automatically evaluate the similarities among the feature motion trajectories. Then the feature tracks are classified by the similarity measure on both the temporal and spatial relationships. The experimental results show that the accuracy of individual segmentation in the crowd can be higher than 90%, and the efficiency of our system can approach 6~ 8 fps.

(16)

Chapter 1. Introduction

With the rapid progress of computing, some video-based applications, e.g., video classification [1], video searching [2], video indexing [3], and video surveillance [4]

become highly demanded. Especially, the New York Times magazine [5] reported that there totally have 1800 cameras installed in all train stations in London, and there are about 4.2 million cameras installed in various cities in UK. It is estimated that one Briton will roughly be captured over 300 times by the cameras in one day. Thus, the major research in this study is addressed on developing new technologies for the video surveillance.

In the conventional blob-based object detection systems, some typical methods are applied to extract the moving objects, e.g., background subtraction [6], optical flow [7, 20-21], frame difference analyses [8], and codebook model [9]. In [6], the regions of moving objects may be acquired precisely by using the method of background subtraction but it is extremely sensitive to the illumination variation or the dynamic background changes. In [7, 20-21], the optical flow method is used to independently track each low-level object feature. Applying frame differencing method [8] may be adaptive to the dynamic illumination changes, but the regions of the moving objects are extracted incompletely when the variation of background occurs. In [9], long sequences of background values are collected to construct the pixel-based codebook via the vector quantization process which can represent a compressed representation of background model. It can overcome the problems of moving backgrounds or illumination variations, but this method is difficult to detect and track the targets in the crowded scene.

Conventional target tracking systems [10, 11] often focus on target tracking on a less crowded scene. For the target tracking, the method proposed by Zhao and Nevatia

(18)

[12, 13] is one of the well-know algorithms. They use color histogram to model each object’s appearance, articulated ellipsoids to model the human shape, and an augmented Gaussian distribution to model the background for segmentation. However, their method is not suitable for tracking in the crowded scenes because only partial region of a tracked target can be seen and then the ellipsoid model can be fail to fit the human body shapes. Liangfu et al. [14] proposed a method for object tracking with large motion area based on the modification of mean shift algorithm. It proposed an efficiently similarity measure function to search the rough location of motion object and then used mean shift method to obtain the local optimal value by the iterative computing.

In the novel video surveillance application, one of the most challenging problems is the target tracking in the crowded scene shown in Fig. 1. Generally, the serious occlusions make the conventional blob-based target detection/tracking methods fail.

For example, to track a target in a crowded rail station or square, the tracked person can be partially occluded by other persons seriously and only part of region can be served as a cue for continuous tracking. Hence, to track the individual target in dense crowds may face two major problems: 1) the target size can be small when we are monitoring a large space where the crowds move; 2) frequent partial occlusions make the target segmentation very difficult. Then, the conventional blob-based methods may fail to track the individual object continuously in the crowded scenes. Therefore, the point-based features are considered to apply to tackle the problems of tracking in crowd. Even if in the crowded scenes, there may have some point features on the partial target regions that are not occluded. In Fig. 2-(b), the example of object extraction using the blob-based method is illustrated. It is obvious that the foreground region formed by merging several targets is difficult to segment each target. On the

(19)

contrary, Fig. 2-(c) shows the feasibility for applying the point-based method to detect and track each individual target in the crowded scene.

(a) (b) (c)

Fig. 1 Take down shot views at the crowded scenes. (a) Crowded scene in tunnel. (b) Crowded scene in Chu-Hua University. (c) Occlusion situation in scene.

(a) (b) (c)

Fig. 2 Detect and segment the target in crowded scene. (a) Original image. (b) The method of blob-based background subtraction. (c) Target detection and tracking using

the point-based feature and each color represents each different individual target.

Cheriyadat and Radke [15] recently proposed a related technique for detecting and classifying the primary crowd moving trajectories with the methods of optical flow and longest common subsequences. Furthermore, the moving trajectories can be clustered into smooth dominant motions. However, the individual segmentation and tracking are not considered in their system. With the same concept of feature point tracking, Brostow and Cipolla [16] proposed a Bayesian clustering algorithm that can segment each individual entity within the crowd with the space-time proximity and trajectory coherence.

In order to detect each individual target in the crowded scenes and analyze the crowd moving trajectories, we propose two methods to detect and track the individual

(20)

target in the crowd and classify the crowd motion trajectories. The system flowchart is shown in Fig. 3. Two main parts in the proposed system are described in the sequel.

1) Classification of Crowd Trajectories: In the left part of Fig. 3, the corner feature points are tracked continuously until it vanishes. We apply the LCSS algorithm [15] to compare all feature tracks with the predefined tracks and classify all the point tracks.

2) Tracking in crowd: In the right part of Fig. 3, we propose a coarse-to-fine approach based on the corner points’ extraction and tracking in which the C-means algorithm is used to extract the coarse clusters and the construction of spatial-temporal shortest spanning tree is proposed to segment each individual subject.

Finally, each target is tracked by the concept of points’ inheritance.

The thesis is organized as follows. Section 2 describes the methods of extracting the low-level features (corner points) and tracking the corner points between successive frames. In sections 3 and 4, we describe the frameworks of two systems:

tracking in crowd and classification of crowd trajectories separately. Section 5 presents the experimental results for both systems. Finally, section 6 concludes the study.

(21)

Displacement Length Filter

Feature Points Over The Threshold

Under The Threshold

Algorithm

Output

C-means Clustering

Construct a Spanning Tree in All Clusters

Relationship

Tracking Result Trajectories

Cluster Result

Fig. 3 System flowchart. The left side describes the method of classification of Crowd Trajectories. The right describes the method of tracking in crowd.

(22)

Chapter 2. Feature Points Extraction and Tracking in Video Sequence

The goal of feature extraction is to seek the reliable image content that can resist of the image translation, rotation, and scaling operations. The robustness of extracted feature will influence the coherence in the point-based tracking algorithm. Therefore, in this chapter we will describe what kind of image feature interested to extract and how does it work in the point-based tracking scheme.

2.1 Feature Extraction

Generally, traditional blob-based object detection methods [17] can extract moving objects when the moving targets are not crowded. However, when many objects move closely, the traditional blob-based methods can’t segment the individual object. Here, we apply the point-based object tracking method to overcome the problem of tracking in crowd.

There are three typical point-based features considered in our system which are SIFT [20], SURF [21] and KLT [7] separately. Basically, the SIFT and SURF belong to the same feature. They are invariant to image scaling and rotating operations and robust for image matching across a substantial range of affine distortion, change in 3D viewpoint, addition of noise, and change in illumination. Each feature can be correctly matched with high probability against a large database of features from many images.

However, the SIFT requires high computation cost in the image matching process.

Therefore, it is difficult to establish a real-time system for the video surveillance applications with the SIFT feature. Hence, we apply the corner feature to develop the

(23)

the corresponding corner features on the next frame will be located by feature tracking method in [18]. Fig. 4 shows an example of corner feature detection.

(a) (b)

Fig. 4 Corner feature points (a) Original image. (b) Green points represent the extracted corner feature points.

2.2 Corner Features

In [19], Kanade et al. presented an image registration technique that makes use of the spatial intensity gradient to find the precise matches. In the following, the authors in [7, 18] modified this method to develop the Kanade-Lucas-Tomsi (KLT) feature tracking algorithm for the feature points’ extraction and tracking in video sequence. In KLT feature tracking algorithm, the covariance matrix for each window which is composed of the designated area’s intensity gradient in x-direction and y-direction is established. With the Eigen analysis of the covariance matrix we can identify whether the feature points exist or not in the searching window. The covariance matrix Cw in a w × w window centered at the pixel p can be defined as:

( ) ( )

∑

_⎥^⎥

⎦

⎤

⎢⎢

⎣

= ⎡

w x y y

y x x

w S I I S I

I I S I

C S ₂

2

σ σ

σ

σ , (1)

(24)

where, S_σ

( )

⋅ is a low-pass Gaussian filter with stand deviation σ. After acquiring the eigenvalues: e1 and e2 for covariance matrix Cw, there are three possibilities when comparing to the pre-defined threshold Teigen:

1. If eigenvalues e1 and e2 are both lower than Teigen, the intensities in the window are alike.

2. If one of them is larger than Teigen, the pixel p is a unidirectional pattern.

3. If eigenvalue Emin = min (e1, e2) is larger than the threshold Teigen, the pixel p is a corner feature [18].

Therefore, we can identify the existence of corner feature by the above rules.

2.3 Feature Tracking by KLT Tracker

In this section, we briefly describe KLT point tracking algorithm in [18]. We assume there are two successive images I and J. In image I there existed a feature points as I(x, y, t) for tracking and image J is represented the successive frame.

Suppose the feature point moving with a displace vector d = (ξ, η) for a time period τ, then the shifted feature point at time t+τ can be expressed as:

I(x, y, t + τ) = I(x – ξ, y – η, t). (2)

If we redefine J(x) = I(x, y, t + τ) and I(x - d) = I(x – ξ, y – η, t), we can obtain

J(x) = I(x - d) + n(x), (3)

where n denote a noise. In order to minimize the residue error over the window W defined in Eq. (4). A closed form solution of displace vector d can be acquired and shown in Eq. (5)

( )

[

I x d J(x)

]

² dx

W ω

ε ⁼

∫

⁻ ⁻ ^, ⁽⁴⁾

where ω is a weighting factor.

(25)

Gd = e, (5)

where gg A

W

T d

G⁼

∫

ω ^and^e⁼

∫

W

⁽

^I ⁻^J

⁾

^gω^dA^.

Finally, we use SVD (singular value decomposition) to obtain the closed form solution of d = (dx, dy) in Eq. (5).

(26)

Chapter 3. Tracking in Crowd

In general, the conventional object detection in crowd may have several problems in the individual segmentation process. First, it’s difficult to find accurate boundaries by using background subtraction methods [5] in crowded scenes. Second, the supervised learning or any subject-specific model [9] needs more computation cost to train. Third, the moving subjects having different moving directions but merge together. In order to tackle these problems, we propose a coarse-to-fine approach based on the corner points extraction and tracking processes, in which the C-means algorithm is used to extract the coarse clusters and the spatial-temporal shortest spanning tree is proposed to segment each individual subject. The system block diagram of tracking in crowd is show in Fig. 5 and described as follows.

(27)

Trajectory Length Filtering

Feature Points

C-means Clustering

Construct a Spatial- Temporal Tree for Each

Cluster

Delete Low Correlation Points for each

Cluster

> LT

True

< LT

Relationship

Object Labeling Feature Points

Tracking

Record Feature Points Trajectories

Initial Frame ? False

Inherit Cluster Information (Member, Center and ID) in

T-1 Frame

Integrate New and Old Clusters Information Add New Feature Points Into

Existed Cluster While They Are Closed and Under

Distance Threshold

Boundary Over Threshold (Target Width)

Separate Cluster and Labeling

Objects

> Twidth

< Twidth

Cluster Other New Feature Points by C-Means Excluding Existed Clusters

Execute Points Correlation Filtering Process

KLT Point Tracking

Points Type Filtering

Points Correlation

Filtering

Fig. 5 System block diagram for tracking in crowd.

3.1 Point-Based Tracking

In our system framework, we firstly detect the low-level feature points with the

(28)

Shi-Tomasi-Kanade detector [7]. In the Kanade’s algorithm, once the corner points are extracted, each feature point can be tracked by the Kanade-Lucas-Tomasi optical flow [17] between two successive frames. Each trace of the tracked corner point can be represented as:

} , , 1 }, , , ),

,

{{(x_tⁱ y_tⁱ t=T_initⁱ K T_finalⁱ i= K N , (6)

where (xti, yti) denotes the image coordinate of the corner point i at frame t, N denotes the total number of feature point tracks, and T denotes the frame number.

Here, our goal is to cluster the feature points on each individual object in the crowd with the close spatial-temporal relationship. Before tracking the corner points on each object some static feature points located on the background region are removed by applying the motion information. Fig. 6-(a) shows the low-level feature points extracted on a frame of the campus video sequence and Fig. 6-(b) shows that some static feature points located on the background region are removed.

(a)

(29)

(b)

Fig. 6 Feature points’ detection. (a) The extracted corer points on a frame of campus video sequence. (b) The static corer points are filtered with the motion information.

3.2 Rough Segmentation with C-Means Algorithm

By the careful observation, the corner points on each individual object can have high spatial-temporal correlation. The spatial correlation measure the geometry relationship among the corner points that belong to an object; while the temporal correlation measure the trajectory consistency for the corner points that belong to an object. Hence, the corner points that belong to an individual object should have high spatial-temporal correlation measures. Here, the C-means algorithm is used to roughly cluster the dynamic feature points based on the spatial correlation measure The C-means algorithm for coarse segmentation of the corner points is described as:

1. Dynamic feature points are firstly recorded in the set S = {p¹,p²,…,p^N} and the track information for each dynamic point is also recorded. Assume Cⁱ represent the set which containing the information of members and cluster center, i is the number of clusters. Initially, the number of clusters i is equal to 1 and Cⁱ= {p¹}.

2. Select a dynamic point from S in order and calculate the distance between each existed cluster centers Cⁱ. The distance is calculated according to the

(30)

formula in Eq. (7) which considers the different weighting on horizontal and vertical directions (based on the consideration of human shape).

( ) ( )

⎭⎬

⎫

⎩⎨

⎧ =

⎭⎬

⎫

⎩⎨

⎧ − + − =

= p C p C i N j N

ce

Distan α_x ⁱ_x _x^j ² α_y ⁱ_y _x^j ²　, 1,2,K, ,　 1,2,K, , (7) where αx and αy are weight of x-axis and y-axis separately. The weighting setting is based on the proportion of the human body. Typically, αx = 1.2~1.6 and αy = 0.5~0.8.

3. If the distance of pⁱ exceeds a threshold Td, then a new cluster is created and the point pⁱ is segmented into a new cluster denoted as Cⁿ⁺¹ = {pⁱ}.

Otherwise, dynamic point pⁱ is assigned to the existed cluster C^*. 4. Repeat the steps 2~4 until all dynamic points in set S are processed.

Fig. 7 shows the results of rough segmentation with C-Means algorithm. Each rough segmented individual cluster is denoted by different color. It is obvious that the crowded region need to be segmented in detail and sophisticatedly.

Fig. 7 The result of rough segmentation with C-Means algorithm.

(31)

3.3 Individual Segmentation with Spatial-Temporal Shortest Spanning Tree

Based on the rough segmentations from the C-means algorithm, the corner points located on a subject should be close on both the geometrical distribution and trajectory. Hence, the cluster refining process implemented with shortest spanning tree is applied to the rough segmented clusters to obtain the individual objects. The algorithm of individual segmentation with the shortest spanning tree is described as follows.

1. Build a shortest spanning tree in each cluster. Let set SST = {p¹, p²…, p^N} represent the point set within a cluster where N demotes the total number of cluster members.

2. Calculate the weighted distance according to the following formula.

( ) ( )

_⎭⎬⎫

⎩⎨

⎧ − + − = =

= p p p p i N j N

Weight β_x _xⁱ _x^j ² β_y ⁱ_y _y^j ²,　 1,2,K 　, , 1,2,K, , (8) where βx, βy are weights on x and y directions separately.

3. According to the distance calculation in step 2, construct the shortest spanning tree sequentially.

4. Retain the weights table and the order of SST for constructing the spatial-temporal shortest spanning tree.

The construction process is illustrated in Fig. 8. Fig. 9 shows examples of the individual segmentations with the spatial-temporal shortest spanning tree. In the following, we will modify this algorithm by combining the trajectory correlation to segment the individual more precisely.

(32)

Fig. 8 The construction of the shortest spanning tree in a cluster.

(a)

(b)

Fig. 9 The usage of shortest spanning tree. (a) Multiple targets are walking closely. (b)

(33)

Given two tracks Xu and Xv generated from two feature points, if the track variance defined in Eq. (9) for the two tracks is small, then the two feature points are likely to belong to same target.

(

X^u X^v

)

Variance

(

X_u X_v

)

n Correlatio

, 1

　 = + , (9)

where Variance(Xu ,Xv) = Variance(DistanceEucl(Xu ,Xv)) within N frames. Finally, we can establish a spatial-temporal conformance measure as:

( ) ( )

(

^,

)

^, ¹^,²^, ^, ¹^, ¹

2 2

+

=

−

− = +

= − i N j i

X X n Correlatio

p p p

e p Conformanc

j i

j y i y y j x i x

x 　

　 β β K

. (10)

If the conformance value is larger than a predefined threshold, it means that the two points don’t belong to the same subject. But sometimes we may face a situation that a feature point belong to subject A may be mis-located on subject B when two subject are crossing. Hence, in the next section we will develop a voting method to reconfirm the point attribute.

3.4 Voting Method for Trajectory Conformance

Although the point-based method can overcome some target tracking problems in the crowded scene, there still exist exceptive situation in point tracking, e.g., a feature point belong to subject A may be mis-located on subject B when two subject are crossing. Fig. 10 illustrates that a lot points are possible to be mis-located when two person are crossing each other.

(34)

Fig. 10 A lot points are possible to be mis-located when two person are crossing each other.

To overcome the above-mentioned problem, we apply the voting strategy with temporal information recording to ensure the trajectory conformance of tracked feature points in a cluster. The voting rule for trajectory conformance is described as follows.

1. The vector Dr_{( y}_x_, ₎

(Ft, Ft-10) acquired from successive 10 frames is defined to record the moving direction for each feature point.

2. Based on the magnitude of Dr_{( y}_x_, ₎

, there are four possibly sign pairs composed of Dr_x

and Dr_y

to denote the moving direction shown in Fig. 11.

They are defined as ( + , + ), ( + , - ), ( - , + ), and ( - , - ) separately. We store the direction at each frame of all dynamic points that belong to each subject.

3. By observing the recorded direction list, there may have a direction with the highest votes. We assume it is the dominant motion direction in this cluster, and then delete other dynamic points with different directions.

4. Repeat Steps 1~3 until all cluster are processed.

5. Once the cluster’s direction list is modified, the cluster center must be

(35)

recalculated.

Fig. 11 The possible combinations of DX and DY.

3.5 Object Tracking

In order to accurately track all targets, to develop a stable and reliable approach is necessary. According to the mechanism of feature tracking in chapter 2, we know the corner feature would be tracked efficiently between two successive frames.

Therefore, the characteristic of KLT tracking algorithm are adopted for tracking the existed dynamic feature points with the inheritance of the points’ attributes in last frame. On the other hand, the new points can appear while the existed points’ tracking fail or vanish. Hence, it’s important to link the tracking relation between old and new points. The complete algorithm for the point-based tracking in crowd is described in Fig. 12.

(36)

Inherit Cluster Information (Member, Center and ID) in

T-1 Frame

Integrate New and Old Clusters Information Add New Feature Points Into

Existed Cluster While They Are Closed and Under

Distance Threshold

Boundary Over Threshold (Target Width)

Separate Cluster and Labeling

Objects

> T_width

< T_width

Cluster Other New Feature Points by C-Means Excluding Existed Clusters

Execute Points Correlation Filtering Process

Fig. 12 The flowchart of the object tracking system.

Based on the property of corner point tracking described in section 2, if the corner points are tracked successfully and continuously, they possibly are located on the same moving target. Therefore, point attribute inheriting with each cluster become very important. The point-based object tracking algorithm is describe as the follows.

Object Tracking Algorithm

1. If the corner points are tracked successfully and continuously, they will inherit the attribute of the segmented individual cluster. Otherwise, if the corner points are newly generated in the area of ROI, we classify the feature points into the set PNEW = {p¹, p²… p^N}. Furthermore, the distance between the new points and existed cluster centers C^j are calculated as:

( ) ( )

^⎫

⎧⎧ ² ² ⎫ ,

(37)

(11) where αx and αy are weights on x-axis and y-axis that are similar to the C-Means clustering process. Then, we add the new point into the closest cluster. Fig. 13 shows the inherited points and new points separately.

2. After adding the new points to the nearest cluster, the cluster centers may change. In addition, the situation that multiple objects moving together may occur and then the individual segmentation is required to perform again. The individual segment mechanism is based on normal width of human body. First we calculate the width of the cluster as:

ry LeftBounda -

ary

RightBound _y _y

= th

ClusterWid . (12)

If the width of the cluster is larger than the predefined threshold TWIDTH

(evaluated by the normal width of human body), then the cluster is required to be segmented. The new segmented cluster will be given a new ID. Fig. 14 illustrates the concept of re-segmentation.

3. If some new points are not classified into any existed clusters, then theses point are classified into the new set P’NEW = {p’¹, p’²… p’^N} and regarded as new moving objects. Fig. 15 illustrates existed cluster (existed objects) and the new clusters (new objects). For this reason, we use C-Means algorithm to classify them but without comparing to the existed clusters.

4. Integrate the new and old clusters’ information.

5. Check the consistence of all points in same cluster and execute the process of individual segmentation with the spatial-temporal shortest spanning tree described in sections 3.3 and 3.4 for every cluster. Fig.

(38)

16 shows the finally results.

Fig. 13 The inherited points (Red) from last frame and new born points (Blue) in instant frame.

Fig. 14 Segmentation of multiple situations. (Red line) Cluster width length. (Blue line) Normal width length of human body.

(39)

(a)

(b)

Fig. 15 Grouping new points. (a) Green points are the independent new points that are distant from other existed red clusters. (b) The result of clustering.

Fig. 16 Final result of segmenting the tracked objects.

(40)

Chapter 4. Classification of Crowd Trajectories

In this study, we adopt the method of the longest common subsequences [15] to automatically analyze the crowd motion trajectories. Fig. 17 shows the flowchart of LCSS algorithm. The proposed system consists of the pre-process of the feature extraction and the trajectory classification using the algorithm of longest common subsequences.

Algorithm

Trajectory Type 1

Trajectory Type 2

Trajectory Type N

‧‧‧‧‧‧‧‧

Output Video Sequence

Trajectory Length Filtering

Feature Points

> LT

< LT

Record Feature Points Trajectories

KLT Point Tracking

Points Type Filtering

Fig. 17 The flowchart of LCSS algorithm used to classify the crowd motion trajectories.

4.1 Preprocess

Here, we apply the point-based object tracking method to collect all the moving

(41)

trajectories for each feature point on the moving object and use the method of LCSS to classify the moving trajectories. The input to our system is a set of feature point tracks which are represented as:

} , , 1 }, , , ),

,

{{(x_tⁱ y_tⁱ t=T_initⁱ K T_finalⁱ i= K N , (13)

where, (xti, yti) denotes the image coordinates of the corner point i at frame t, N denotes the total number of feature point tracks, and T denotes the frame number. The points are tracked continuously until the tracking process vanishes. Fig. 18 illustrates the extracted point tracks which are based on the feature point tracking scheme mentioned in sections 2 and 3.

(a)

(b)

Fig. 18 (a) The extraction of corner points. (b) The tracks of the corner points.

(42)

Before the classification of crowd moving trajectories, some static points and noisy tracks are needed to be removed to improve the classification efficiency and accuracy, which is shown in Fig. 19. In addition, a region of interest (ROI) shown in Fig. 19 is defined to detect the feature points in the major area and then classify the feature tracks.

(a)

(b)

Fig. 19 (a) Detection of the feature points in the predefined ROI. (b) Filtering of the static points.

(43)

4.2 Longest Common Subsequences Algorithm

The aim in this section is to classify feature point tracks that are spatially closed.

Thus, we require a distance metric based on the longest common subsequence for comparing point tracks. Let A and A’ denote two feature point tracks defined as Equations (14) and (15).

( )

{ }

{

^, ^, ⁼¹^,_K^, ^, ⁼¹^,_K^,⁶

}

= x y t N i

Aⁱ _tⁱ _tⁱ (14)

( )

{

^x^'^,^y^' ^,^t ¹^, ^,^N^'

}

A′= _t _t = K (15)

We predefined 6 categories of trajectories A¹ ~A⁶which are represented with 3 colors and different directions respectively. Each point track will compare to the predefined trajectories. Fig. 20 shows the predefined trajectories A¹ ~A⁶.

Fig. 20 The predefined trajectories A¹ ~A⁶.

Here, the distance metric for two point tracks is measured by the matching cost function M (A, A’) defined in Eq. (16).

( ) ( )

( ) ^{( )} ( )

otherwise

n n and n

A n A if

empey is A or A if A

A M A A M

A A M A

A M

n n

n n n

n

n n n

n − <ε − <δ

⎪⎩

⎪⎨

⎧

+

=

−

− '

2 ' '

' '

' 1 ' '

' 1

' 1 ' 1 '

'

, , , , max

, , 1

, 0

, ,

(44)

(16) where, the variable δ controls the flexibility of matching tracks in time and the variable ε controls the spatial matching threshold. The value represents the matching cost between two point tracks. We defined a two-dimensional array Q with N rows and N’ columns, which is populated from the upper left element to the lower right element illustrated in Fig. 21. The longest common subsequence between the two tracks is implicitly represented by the number of corresponding points obtained by computing the distance metric M. We defined the set of point pairs as I:

(

^I ^I

)

ⁱ ^L

I = _i^A, _i^A^' , =1,K, , (17)

where I^A and I^A’ are represented the matching indices for tracks A and A’

corresponding to the matching cost M, and L represents the amount of matching points. The longest common subsequence can be acquired by the backtracking algorithm in [15]. In addition, a matching ratio R for every pair of point tracks is defined to measure point matching degree, which is shown in Eq. (18).

R = L / N . (18)

The spatial similarity between two feature points’ tracks is evaluated as

(

^A ^A

) {

^A

( ) ( )

Î Â Î ⁱ ^L

}

D_spt , ' max _i^A ' _i^A , 1, ,

2

' = K

−

= . (19)

Finally, we combine the matching ratio R and the spatial similarity Dspt to distinguish which pre-defined track is matched.

(45)

Fig. 21 Example of matching cost M. (a) Two tracks are compared in term of matching cost. (b) The 2D array for computing the matching cost.

(46)

Chapter 5. Experiment Results

In this chapter, some important experimental results for the proposed systems on the crowded scene are shown. The experiments consist of two parts:

1) Tracking in crowd,

2) Classification of crowd motion trajectories.

5.1 Tracking in Crowd 5.1.1 Test Videos

Two test videos: “Commons01” and “Tunnel-A125” are used to test the performance of individual segmentation and object tracking. The video

“Commons01” is obtained from the web site in [15]. It is captured from a camera above a building gate with the camera tilt angle 40°. The amount of video frames in this video sequence is 678 and the video length is about 22 seconds. Another test video “Tunnel-A125” is provided from professor Brostow [16], which is captured from a tunnel where the people are walking side by side. The amount of frames in this video sequence is 3307 and the video length is about 110 seconds.

5.1.2 Corner Points Detection on the Moving Object

Here, we detect the corner feature points in the initial frame and track these points continuous until the tracking fail or vanish in the scene. When the old points (red) vanished, the system will add new points (green) at the location where it is a corner point. Then the moving objects are detected by filtering the track length. Fig.

22 and Fig. 23 illustrate the detected corner points on the moving object from the successive frames in two test video sequences respectively.

(47)

frame #466 frame #471

Fig. 22 Moving object detection by filtering the tracks length in the outdoor scene.

(48)

Fig. 23 Moving object detection by filtering the tracks length in the indoor scene.

(49)

5.1.3 Construction of Spatial-Temporal Shortest Spanning Tree

The coarsely classify by the C-Means algorithm in each cluster, and we also construct the Shortest-Spanning Tree in the pre-process. In Fig. 24, the each color are represented the different cluster, and all the points in the cluster are linked by the line.

The result is not be modified with any method, thus it is appeared some irregular.

frame #520 frame #525 (a)

(50)

frame #699 frame #704 (b)

Fig. 24 Construct the Shortest-Spanning Tree in each cluster. (a) The simple case in the outdoor scene. (b) The situation of people walking arm-in-arm in the indoor scene.

5.1.4 Segmentation with the Spatial-Temporal Shortest Spanning Tree

Based on the proposed method mentioned in section 3, we can partition the individual target in the crowd. Fig. 25 shows the experiment result of the individual segmentation with the method of spatial-temporal shortest spanning tree.

frame #520 frame #525 (a)

(51)

frame #699 frame #704 (b)

Fig. 25 Segment the objects with the spatial-temporal shortest spanning tree. (a) Individual segmentation in outdoor scene. (b) Individual segmentation in indoor

scene.

5.1.5 Object Tracking

The object tracking based on the concept of feature points’ inheritance mentioned in section 3. The proposed point-based object tracking algorithm can make the target tracking more accurate and efficient. Fig. 26 illustrates the experimental results for the point-based object tracking. In Fig. 26, there are 3 people walking closely and the occlusion problem become serious. The proposed method can segment each individual person even the problem of occlusion exists.

(52)

frame #637

Fig. 26 Segmentation and tracking for the target in crowd.

5.1.6 Accuracy Analyses

In order to evaluate the accuracy of the proposed system, we compare the

(53)

accuracy analysis among the methods in [12, 16] and ours. It can be seen that the performance in detection rate and miss detection rate outperforms the other methods and the false detection rate is close to the other methods.

Table 1 The accuracy analysis for the methods of Brostow & Cipolla, Zhao & Nevatia, and ours.

Brostow & Cipolla Zhao & Nevatia Ours

distinct detections 144 8466 1319

correctly detected 136 7881 1254

missed detections 8 585 65

false detections 33 291 56

detection rate 94% 93.09% 95.07%

miss detection rate 22.9% 6.91% 4.92%

false detection rate 5.6% 3.43% 4.25%

In the accuracy analysis, we abstract 400 successive frames from the video

“tunnel-A125”. The efficiency of our system can approach 6~ 8 fps and it can be further improved with the adjustment of the amount of feature points. The method of Bayesian detection [16] selected the test video frames from “tunnel-A125” randomly.

In order to calculate the likelihoods between many image features, therefore the computational complexity of method [16] is higher. They take about 5 seconds to perform the spatial clustering process for each frame. In our experiment, we analyze and count the detection rate in the area between blue lines that we define. Some results of individual segmentation are shown in Fig. 27.

(54)

Fig. 27 Analyze and count the detection rate in the area between blue lines.

(55)

5.2 Classification of Crowd Trajectories 5.2.1 Test Videos

The video sequence is used for testing the classification of crowd motion trajectories is captured from an open space in Chu-Hua University. The amount of video frames in the sequence is 1128 and the length is about 2 minutes.

5.2.2 Classification of the Points’ Trajectories

The area of the pavements in Chu-Hua University is served as the test environment where the people move frequently. Fig. 28 shows the result of the experiment. In this experiment, we extract the feature point trajectories from sparse and dense crowd scenes respectively, and the trajectory classification not only focuses on the human tracking but also on the traffic monitoring such as vehicle detection (see Fig. 28 (c)-(d)). The LCSS algorithm analyzes and distinguishes which predefined trajectories is matched for each trajectory. The direction of feature’s motion is represented by the arrowed lines in different colors. For example, in Fig. 28-(b), even if the people move in the same direction, our system can still categorize two different predefined trajectories.

(a)

(56)

(b)

(c)

(d)

Fig. 28 Classify the points’ trajectories. (a) The simple case of few people walks separately. (b) Crowds walk closely. (c), (d) Classify the trajectories on the car.

(57)

Chapter 6. Conclusion

In this thesis, we propose two of intelligent analyze systems based on the feature tracking scheme on the crowd scene. In the subsystem of tracking in crowd, the coarse-to-fine approach can modify the roughly result more completely. The Spatial-Temporal Relationship and inherit-concept target tracking scheme improve some problems. First, target detection can’t be accurate in the crowded scenes that people are walking closely. Second, the single target tracking become difficult on the situation of serious occlusion in crowd scenes. The experimental results show that the accuracy of individual segmentation in the crowd can be higher than 90%, and the efficiency of our system can approach 6~ 8 fps. In the subsystem of classification of crowd trajectories, we use the matching approach of longest common sequences [15]

to do experiment in our campus. The generated motion directions are useful representations of long-term crowd activity, and it’s full of the practicability and futurity. The future works in the subsystem of tracking in crowd is focus on the aim of refining the accuracy of targets’ segmentation and tracking, and make the system could be flexibly use on more scenes.

(58)

Chapter 7. References

[1] M. Petkovic and W. Jonker, “Content-Based Video Retrieval by Integrating Spatio-Temporal and Stochastic Recognition of Events,” IEEE Workshop on Detection and Recognition of Events in Video, July 2001, pp. 75-82.

[2] J. Assfalg, A. D. Bimbo, and M. Hirakawa, “A Mosaic-Based Query Language for Video Databases,” IEEE International Symposium on Visual Language, 2000, pp. 31-38.

[3] S. Dagtas, W. Al-Khatib, A. Ghafoor, and R. L. Kashyap, “Models for Motion-Based Video Indexing and Retrieval,” IEEE Transactions on Image Processing, Vol. 9, No. 1, Jan. 2000, pp. 88-101.

[4] I. Haritaoglu, D. Harwood, and L. S. Davis, “W4：Who? When? Where? What?

A Real-time System for Detecting and Tracking People,” IEEE International Conference on Automatic Face and Gesture Recognition, April 1998, pp.

222-227.

[5] J. Rosen, “A Cautionary Tale for A New Age of Surveillance,” The New York Times Magazine, Oct. 2001.

[6] C. R. Jung, “Efficient Background Subtraction and Shadow Removal for Monochromatic Video Sequences,” IEEE Transactions on Multimedia, Vol.

11, No. 3, April 2009, pp. 571-577.

[7] J. Shi and C. Tomasi, “Good Features to Track,” Proceedings of the IEEE Conference Computer Vision and Pattern Recognition, June 1994, pp. 593-600.

[8] S. Wang, X. Wang, and H. Chen, “A Stereo Video Segmentation Algorithm Combining Disparity Map and Frame Difference,” International Conference on Intelligent System and Knowledge Engineering, Vol. 1, Nov. 2008, pp.

中 華 大 學

中 華 大 學 碩 士 論 文

題目：人群中之行人追蹤與軌跡分類

Tracking and Trajectory Classification in Crowd

系 所 別：資訊工程學系碩士班 學號姓名：M09602004 甘偉欣 指導教授：連 振 昌 博 士

中華民國 九十九 年 二 月

摘要

致謝

目錄

第一章 簡介

第二章 連續 影像中擷取特徵點以及追蹤特徵點

第三章 人群中之行人追蹤系統

第四章 人群中之軌跡分類系統

第五章 實驗結果

第六章 結論

英文附錄

Tracking and Trajectory Classification in Crowd

Prepare by Wei-Hsin Kan Directed by Dr. Cheng-Chang Lien

Computer Science and Information Engineering Chung-Hua University

Hsin-Chu, Taiwan, R.O.C.

February, 2010

Abstract

Contents

Chapter 1. Introduction

Chapter 2. Feature Points Extraction and Tracking in Video Sequence

( ) ( )

( ) ( )

∑

( )

( )

[

]

∫

∫

∫

(

)

Chapter 3. Tracking in Crowd

( ) ( )

( ) ( )

(

)

(

)

( ) ( )

(

)

( ) ( )

Chapter 4. Classification of Crowd Trajectories

( )

{ }

{

}

( )

{

}

( ) ( )

( ) ( )

( ) ( ) ( )

(

)

(

) {

( ) ( )

}

Chapter 5. Experiment Results

Chapter 6. Conclusion

Chapter 7. References

中華大學

中華大學碩士論文

系所別：資訊工程學系碩士班學號姓名：M09602004 甘偉欣指導教授：連振昌博士

中華民國九十九年二月

第一章簡介

第二章連續影像中擷取特徵點以及追蹤特徵點

第三章人群中之行人追蹤系統

第四章人群中之軌跡分類系統

第五章實驗結果

第六章結論

⁽

⁾

( ) ^{( )} ( )