中華大學

(1)

中華大學碩士論文

題目: 於壓縮域影片進行人類行為分析 Human Action Analysis Using Motion History

Polar Histogram

系所別：資訊工程學系碩士班學號姓名：M09302010 傅郁婷指導教授：連振昌博士

中華民國九十七年八月

(2)

摘要

本論文針對監控應用的需求，提出了一個針對壓縮域格式影片中特定行為辨識的系統；主要利用三組GOP的motion vector，加上不同的灰階值表示法。轉換至一種新的行為特徵MHPH。並利用ADABOOST對大量的MHPH進行訓練。以產生出強大的分類器，可以直接針對一連串的影片即時而快速的進行行為偵測。

(3)

目錄

摘要 ………2

目錄 ………3

第一章簡介 ………5

第二章人類動作特徵擷取………7

第三章利用 AdaBoost 進行訓練與辨識 ………8

第四章實驗結果………9

第五章結論 ………10

(4)

第一章簡介

視覺監控系統被應用在各種場合上，例如:戶外環境安全及居家看護系統上面。而這些應用都是建立在人類行為分析的基礎上;同時，現今的視覺監控系統環境多以網路架構為主。但是，大量的未經壓縮原始資料受限於網路頻寬，往往會導致無法順暢的傳送，使得系統無法順利的對資料進行分析。因此，當傳送的資料為壓縮域格式的影片時，就可以降低網路頻寬有限的問題。

基於上述的原因，本論文針對監控應用的需求，提出了一個針對壓縮域格式影片中特定行為辨識的系統；主要的部份是(1)先將壓縮域的影片進行解碼，擷取出每組gop內p frame的motion vector。(2)依續將三組gop的motion vector 串起來。並轉換成Motion history polar histogram(MHPH)。(3)利用ADABOOST 對MHPH進行辨視;而我們的方法優點在於(1)利用三組GOP所串起來的Motion vector去辨視特定行為，而不是利用單張影像去追縱辨識。可以針對一段video 去辨視是否為特定行為。(2)因為是取motion vector當feature，所以明亮度的變化並不會對我們的辨識造成影響。(3)利用adaboost去做training及辨識。可以加快辨識的效能及正確率。

(5)

Motion F eature E xtraction

C onvers ion from motion vector to

MH P H

Recognition of human

actions MH P H P os itive

and Negative training s amples

Adaboos t training weak clas s ifier to s trong clas s ifier

NO

Y E S C ompres s ed

V ideo

Conversion from motion vector to

MHPH Motion F eature

E xtraction C ompres s ed

V ideos

圖 1.1 系統架構圖

(6)

第二章

人類動作特徵擷取

我們利用連續累加三組GOP的移動向量；針對三組GOP的OAMV，去轉換成 24 個 bin 的 2D polar histogram 。同時，稱此結果為 Motion history polar histogram(MHPH)；同時，將每個 bin中區分出三種不同的灰階值，以用來去區分MHPH裡，各組GOP的資訊。並且針對每組GOP中，落於各個bin的移動向量數量多寡，分別給予四種不同的灰階值。當落於 bin內的移動向量數量愈少，表示此區域的在motion愈小。因此，就給予較高的灰階值(灰階值等於255時，是代表白色)。反之，當落於 bin內的移動向量數量愈多，表示此區域的在移動向量愈多。

因此，就給予較高低的灰階值；以蹲下的MHPH為例，經由統計的結果，第一組GOP 落於各bin裡的移動向量的數量平均在5個左右。因此，當bin裡的移動向量數量小於等於1時，視為沒有移動向量，給予灰階值為255。當bin裡的移動向量數量小於等於4且大於等於2時，視為有輕微的移動量，並給予灰階值為235。當bin 裡的移動向量數量小於等於7且大於等於5時，視為有較多的移動量，並給予灰階值為210。當bin裡的移動向量數量大於7時，視為有很大的移動量，並給予灰階值為190。

(7)

圖2.1 三組GOP轉換成MHPH

(8)

第三章

利用 AdaBoost 進行訓練與辨識

Adaboost是由Viola-Jones [3]所提出，其中利用了三個重要的方法，來完成一個執行速度即快的物件偵測架構。

(1)利用積分影像的方式，使得灰階影像中任意矩形區域的灰階值總合可以快速的以查表方式求得，並配合Haar-like特徵(矩形特徵)完成一套快速比對特徵的方法。

(2)由於單一的矩形特徵對複雜目標物的辨識率不高，因此採用AdaBoost[4]

作為訓練方法，挑選並結合多個特徵與適當的權重值，訓練出辨識率更高的分類器。

(3)提出串列式的決策架構，以每個Classifier 為最小單元代表一個簡單的矩形特徵。再以AdaBoost 訓練出一個Stage，每個Stage 包含多個Classifier，

最後將多個Stage串聯成一個辨識率最高的偵測器。藉由該架構可快的排除影像中與目標物差異較大的區域，減少不必要的計算。

來對各種行為的 MHPH 進行特徵選取，同時訓練出辨識率更高的分類器。可以快速而正確的進行人類行為的辨識。

(9)

第四章實驗結果

將本篇論文所提出的方法與[19]由 Alireza Fathi, Greg Mori 所提出的方法，做一比較。Alireza Fathi, Greg Mori 主要是利用非壓縮域的影片中的 low-level optical flow information 去建構出 mid-level motion features。

並利用 multi-adaboost 對所建構出的 mid-level motion feature 進行訓練與辨識；以辨識行為：往左或往右走路的正確率而言，我們方法優於 Alireza Fathi et al.[19]。但是以辨識行為：往左跑或往右跑的正確率而言，我們的方法相欣於 Alireza Fathi et al.;同時，我們的方法是可以辨識壓縮域的影片中的行為，

但是 Alireza Fathi et al.只能用於辨識非壓縮域的影片中的行為；另外在雜訊的影響方面，我們的方法對於輕微的雜訊可以藉由移動量的多寡來進行去除的動作。但是 Alireza Fathi et al.對於雜訊的去除方面無法做處理。

(10)

第五章結論

由上一章節的實驗結果發現，(1)從壓縮域影片中，所萃取出來的 MHPH 特徵特，

利用 Adaboost 進行人類行為的訓練與辨識，所出來的辨識結果優於[19]Alireza Fathi, Greg Mori 所提出的方法。(2)我們的方法強調於在壓縮域影片中進行人類行為的訓練與辨識。而[19] Alireza Fathi, Greg Mori 所提出的方法是針對未壓縮的影片進行辨識；(3)我們的方法只要有足夠的訓練資料，就可以針對不同行為的 MHPH 進行訓練與辨識。優於[16][17][18]所提出的做法，只能單純的辨識出行人卻無法辨識人類行為；(4)在雜訊的處理方面，由於 MHPH 會針對 motion 量給予不同的灰階值。因此，對於較小的雜訊在給予灰階值時並一併去除。

然而，我們的做法在未來尚有需要改進的部份。(1)針對於正確率的部份，目前只能針對單一種行為在不同的分類器中，進行是與否的行為辨識。但是 [19]所利用的是 Multi-Adaboost，在同一時間內對不同行為做辨識。相同的，

在未來亦可利用 multi-Adaboost 對不同行為在同一時間內進行訓練。以達到在

(11)

對於正確率也會有明顯的改進；(3)由於我們的 training data 只針對單獨個人的行為，若影片中有兩個以上的行人或行為，我們就無法做辨識。因此，這也是我們要克服的主要問題。

(12)

英文附錄

(13)

Human Action Analysis Using Motion History Polar Histogram

Prepared by Yu-Ting Fu Directed by

Dr. Cheng-Chang Lien

Computer Science and Information Engineering Chung Hua University

Hsin-Chu, Taiwan, R.O.C.

August, 2008

(14)

Abstract

The human behaviors are frequently analyzed by using the uncompressed video data (raw data). However, the amount of uncompressed video data is too huge to be transmitted via the network with limited bandwidth. Hence, the rare behavior detection within the compressed video may not only extract the visual features directly from MPEG compressed video but also overcome the limited network bandwidth problem. In this paper, several visual features extracted from the compressed video, e.g., the motion vectors and color feature are applied to develop new action feature descriptor. Based on the action feature descriptor the human actions are detected and then the rare behaviors may be identified. The proposed human action analysis system consists of following novel technologies: (1) Partial decoding of the compressed video stream and extracting of the motion vectors from the p-frame in each GOP, (2) Generation of the object-based accumulative motion vector (OAMV) from continuous three GOPs, (3) Construction of Motion History Polar Histogram (MHPH), and (4) Utilizing Adaboost algorithm to recognize various kinds of human actions. Experimental results show that the recognition rate can approach 90%.

Keywords: Human action analysis, compressed domain, object-based accumulative motion vector, Motion History Polar Histogram.

(15)

Contents

Abstract ... 14

Chapter 1 Introduction ... 16

Chapter 2 Motion Feature Extraction ... 19

2.1 Extraction of Motion Feature from MPEG Video ... 19

2.2 The Accumulative Motion Vector ... 21

2.3 The Object-based Accumulative Motion Vector (OAMV) ... 22

2.4 Motion History Polar Histogram ... 26

Chapter 3 Human Action Recognition using Adaboost ... 30

3.2 AdaBoost Training Procedures ... 32

4.1 Construction of MHPH ... 36

4.2 Training Process for Adaboost Classifier ... 37

Chapter 5 Conclusion ... 46

Reference ... 48

(16)

Chapter 1 Introduction

Recently, the surveillance systems are widely applied to the public area security [6-9] and homecare applications [10]. Both applications are established on the basis of human behavior analysis. In this study, a novel human action analysis is developed to meeting this requirement. The human behaviors are frequently analyzed by using the uncompressed video data (raw data) [11-12]. In [11], the human area mesh feature [13]

extracted from the successive frames is used to recognize the different tennis strokes.

In [12], based on the changed regions, the human actions may be analyzed by the Motion Energy Image and Motion History Images.

However, all the motion features developed in the abovementioned researches are hard to generate in the compressed video. Furthermore, the amount of uncompressed video data is too huge to be transmitted via the network with limited bandwidth. Hence, the rare behavior detection within the compressed video may not only extract the visual features directly from MPEG compressed video but also overcome the limited network bandwidth problem.

In [14], the multi-resolution images are applied to detect the human region and

(17)

However, a lot of templates are necessary to define for the template matching. In [15], the polar histogram of the motion vectors is used to analyze the various kinds of human actions. In [16], Adaboost algorithm is used to recognize both motion type and appearance information for identifying a walking person. In [20], the proposed method of Alireza Fathi and Greg Mori is the utilization of low-level optical flow information of uncompressed video data to for the structure of mid-level motion feature, which is then uses for training and detection with multi-adaboost.

Based on the observations of human actions, a complete action usually takes a few seconds, i.e., several GOPs within the compressed video are needed to analyze the human actions. In this study, several visual features extracted from the compressed video, e.g., the motion vectors and color features, are applied to develop new action feature descriptor and then the human behaviors is analyzed to detect the rare behaviors. The proposed human action analysis system consists of following novel technologies. First, we partially decode of the compressed video stream and the extract the motion vectors and color feature from the p-frame in each GOP. Second, a new motion feature called object-based accumulative motion vector (OAMV) is proposed to form a robust and prominent motion feature from continuous three GOPs.

Third, based on the OAMV we construct the Motion History Polar Histogram (MHPH) in which various kinds of human actions are described. Finally, we utilize Adaboost

(18)

algorithm to recognize various kinds of human actions. The system flowchart is illustrated in Fig. 1.1.

Motion F eature E xtraction

C onvers ion from motion vector to

MH P H

Recognition of human

actions MH P H P os itive

and Negative training s amples

Adaboos t training weak clas s ifier to s trong clas s ifier

NO

Y E S C ompres s ed

V ideo

Conversion from motion vector to

MHPH Motion F eature

E xtraction C ompres s ed

V ideos

Fig. 1.1 The system flowchart.

(19)

Chapter 2 Motion Feature Extraction

By the careful observation, the movements of different body parts are important cues to analyze human actions. In this study, we generate the Object-based Accumulative Motion Vector (OAMV) to segment the region of human body and construct the prominent accumulative motion feature by tracking each motion flow within several GOPs for each motion macroblock in p-frame. But a complete human action may takes 2 ~3 seconds, the time interval for a single GOP can not offer a sufficient time window to observe. Therefore we applying the concept of sliding window composed of several GOPs to extract the OAMV and recognize the human action that will be mentioned later.

2.1 Extraction of Motion Feature from MPEG Video

In this study, the human action is analyzed within the compressed video stream and then all the features are extracted by using the partially decoded video data. The MPEG decoding scheme from the higher to lower layers is illustrated in the Fig. 2.1.

(20)

Fig. 2.1 The layer structure for the MPEG coding system.

For each macroblock, x represents the horizontal motion vector and y represents the vertical motion vector. The vector x+y denotes the motion vector of this macroblock shown in Fig. 2.2. For example, the motion vectors for a man who is walking and falling down are shown in Fig. 2.3. It is obviously that the motion features for the different kinds of human actions may be discriminated by applying the precise descriptions about the magnitudes and directions of the extracted motion vectors.

(21)

Fig. 2.3 The motion vector of a macroblock .

(a) (b)

Fig. 2.4 The motion vectors for actions (a) walking and (b) fall down

2.2 The Accumulative Motion Vector

The action is frequently analyzed with the frame-based method in which the probability distribution of the quantized motion vectors is acquired to analyze the human action [1]. The time period used to calculate the probability distribution is about a few seconds. In order to develop a robust motion descriptor, Babu et al. [2]

proposed a new motion feature called accumulative motion feature to accumulate the motion vectors among the successive frames. The motion features are accumulated backward and forward according to the following formula:

( ) ^{( )} _∑

⁻

( ^{( )} ^{( )} )

−

=

+

= ⁿ ^c

n f

ki y ki

x f m f

m l

k l k

1

, ˆ ,

ˆ,

,

MB

．

_x

y

y x+

(22)

( ) ^{( )} _∑

⁻

( ^{( )} ^{( )} )

−

=

−

= ⁿ ^c

n f

ki y ki

x f m f

m l

k l k

1

, ˆ ,

ˆ,

(2.1)

where n is the index of current frame, (k, l) denotes the center position of a macroblock, and m_x

( )

• and m_y

( )

• denote the motion vectors in x and y directions respectively. Then the representative motion for each macroblock is obtained by choosing the median value of all the motion vectors falling around the corresponding macroblock. However the accumulation process is performed at the same position such that the prominent motion trajectory for each macroblock in the human body is difficult to acquire.

2.3 The Object-based Accumulative Motion Vector (OAMV)

Generally, the time period for a complete human action will be a few seconds.

Hence, several GOPs are required to extract the visual features for recognizing the human action. To obtain the prominent motion feature for human action a new GOP-based motion descriptor called the object-based accumulative motion vector (OAMV) is generated. The concept of the OAMV is described in Fig. 3.5. In Fig. 3.5, the blue line segments represent the motion vectors of crouch down movement. It is

(23)

motion feature that can characterize each human action well.

(a) (b)

(c) (d)

Fig. 2.5 Squat down movements (a) Frame number 16. (b) Frame number 19. (c) Frame number 22. (d) Frame number 25.

By continuously tracking each block within each GOP, the OAMV motion feature is generated. The OAMV motion descriptor is generated with the following steps:

(1) Initialize a starting frame and record the center position for each block.

(2) For the P-frame, by using the extracted motion vectors the new center position for each block is updated with the center position of the corresponding block shown in Fig. 3.6. Record the updated center position for each block within the object.

(3) Repeat the process in step 2 until all the P frames are searched.

(24)

the human body shown in Fig. 2.7 .

Fig. 2.6 The new center position for the block (blue) is updated with the center position of the corresponding block (red).

(a) (b) (c) (d)

Fig. 2.7 Construction of OAMV for the block on face region. The block tracking process is illustrated in p-frames (a), (b), and (c). (d) The OAMV motion feature.

Based on the careful observation of human actions, we apply the concept of sliding widow with time interval of three GOPs to construct the OAMV feature and

Frame n+1

Frame n

time

(25)

the corresponding human action.

(a) (b)

(c) (d)

Fig. 2.8 Fall down movements (a) GOP3. (b) GOP4. (c) GOP5. (d) Accumulation three GOP result.

(26)

(a) (b)

(c) (d)

(e) (f)

Fig. 2.9 Result of accumulating motion vectors from continuous three GOP

2.4 Motion History Polar Histogram

Here, the polar histogram [10] for the OAMV motion feature is applied to describe the moving characteristic of human action. The objective of the polar histogram is to describe the human action with its own unique pattern. For example,

(27)

the OAMV motion features from Cartesian coordinate to the polar coordinate. The

transformation formulas are written as:

2 2

x

x m

m

m_θ = +

, (2.3)

⎟⎠

⎜ ⎞

⎝

= ⎛

∠ ⁻

x y

m m_ϑ tan ¹ m

.

(2.4)

Based on the concept of OAMV and motion polar histogram, we propose the motion history polar histogram (MHPH) to describe the polar distributions of the OAMV feature across three GOPs. Fig. 3.10 illustrates two MHPHs for the walking and falling down actions.

(a) (b)

(c) (d)

Fig. 2.10 The MHPH in 2-D polar feature expressions. (a) The walking action. (b) The falling down action.

(28)

The From the MHPH in Figure 3.10 we can see that there are 3 colors in every bin. This is because we will distinguish every data of each set of GOP in the MHPH by quantifying the motion vector that falls into each bin in each set of GOP with 4 different values of grey. When the quantity of motion vectors that falls into each bin is low, it represents that the motion in this particular area is small, thus the higher value of grey is given. (When grey value is 255, it is represented by the color White). On the other hand, when the quantity of motion vector in the bin is high, it represents that the motion is large. Therefore it will be given a lower value of grey; with the MHPH for squatting as an example, after compiling the data, the quantity of motion vectors that falls into the bin of the first set of GOP is average at 5. Thus when the quantity of motion vector in the bin is less than 1, it is considered as no motion and will be give the grey value of 255. When the quantity is greater than 2 but less than 4, it will be considered as slight movement and will be give the grey value of 235. When the quantity is greater than 5 but less than 7, it is considered as significant movement and will be given the grey value of 210. When the quantity is greater than 7, it is considered a large movement and will be given the grey value of 190; the MHPH in

(29)

Table 2.1 MHPH in each GOP corresponding grey value GOP Number

Of MHPH

No Motion quantity

Slight motion quantity

More motion quantity

Huge motion quantity

GOP1 255 230 150 75

GOP2 255 200 125 50

GOP3 255 180 100 25

Based on the conventional polar histogram [10], we improve it by adding the time sequence relation of OAMV from sequence of GOP into each bin of the conventional polar histogram. The proposed novel polar histogram is called motion history polar histogram (MHPH) that is constructed by the following steps.

Step1. The OAMV for successive three GOPs are quantized and encoded into the corresponding bin of the conventional polar histogram (24 bins).

Step 2. According to the different quantity of motion vector, the bin for each GOP is mapped with the gray scale listed in Table 2.1.

Fig. 2.11 3 sets of motion values in 3 sets of GOP shown in 1 set of bin

(30)

Chapter 3 Human Action Recognition using Adaboost

In this study, we apply the method of Adaboost [3] to classify the MHPH pattern of each human action and then recognize various kinds of human action. Here, some important concepts of Adaboost and the method of recognizing the human actions are described.

3.1 Integral Image

Viola-Jones [3] introduced a new image representation called integral image which makes the computation of the rectangular features efficient. The integral image at location (x, y) is the sum of the pixel value above and to the left of (x, y) inclusive shown in Fig. 3.1.

Fig. 3.1 Definition of Integral image.

。 , ) (x y

(31)

Fig. 3.2 Calculation of Integral Image at (x,y).

Fig.3.3 Fast calculation of Haar feature using integral images.

By the definition of integral image, we can obtain various size and orientation of rectangular feature in an image as shown in Fig.3.4 Each image has only one rectangular feature and this rectangular feature acts as filter and shifts its position over the image. As mentioned above, the size can be varied, but the proportion of the white rectangular region and black rectangular region must be the same.

Fig. 3.4 Rectangular feature in an image.

A B

C

．

D

．

³

．

1 2

4

) 3 ( ) 2 ( ) 1 ( ) 4 ( ) 1 (

) 2 (

) 3 (

) 4 (

s s s s D

A s

B A s

C A s

D C B A s

−

− +

=

= +

=

+ + +

= (0,0)

(x,y)

) , ( ) , 1 ( ) , (

) , ( ) 1 , ( ) , (

y x v y x s y s s

y x i y

x v y x v

+

−

=

+

−

=

(32)

3.2 AdaBoost Training Procedures

A given set of features and a set of positive and negative images, the AdaBoost algorithm will choose the best weak classifier in each round. A weak classifier is better than randomly guess which has only 50% accuracy. The best weak classifier is the classifiers which achieve the smallest training error. After choosing the best weak classifier, the AdaBoost will adjust the weight of every training image. The weight of misclassified training images will increase in this round, and the weight of classified training images will remain. In the next round, the AdaBoost will focus on the misclassified images and try to let the misclassified images can be classified correctly in the last round.

The algorithm of AdaBoost is shown in Table 3.2. First of all, we will have a set of m negative image and l positive image respectively. Then, choose N features from a set of rectangular filter which can represent pedestrian. Finally, the strong classifier is a linear combination of these N weak features.

(33)

Fig. 3.5 Cascaded classifier [3].

Table 3.2 The algorithm of AdaBoost

Given example image (x₁,y₁),K,(x_n,y_n) where y_i =0 for negative and y_i =1 for positive examples respectively.

{

w1^,w2^, wn

}

^,

W = K where

⎪⎪

⎩

⎪⎪⎨

⎧

=

examples positive

of number m

y l if

examples negative

of number m

y m if w

i i i

//

2 1 1

//

2 0 1

For t =1,K,T

(1) Normalize the weights,

∑

=

= _n

j j

i

i w

w w

1

(2) For each filter f_j∈filterSet , train a feature h_j by determining the optimal threshold for the filter

(3) Assign errorsεj ⁼

∑

ⁿiwihj(xi)⁻yi

(4) Select the classifier h with the lowest error _t ε_t

(5) Update the weights, w_i =w_iβ^eⁱ where e_i =1 if example x is correctly _i classified otherwise e_i =0 and

t t

t ε

β ε

= − 1 All Sub Windows All Sub Windows All Sub Windows

Classifier 1 True Classifier 2

Classifier 3

True True

False

False False

Reject Sub Windows

Further Processing

(34)

The final strong classifier is

⎪⎩

⎪⎨

⎧ ≥

=

∑

=

∑

=

otherwise x h x if

h

T

t t

T

t t t

0

2 ) 1 ( ) 1

( ¹α ¹α

Where 1 )

log(

t

t β

α =

(35)

Chapter 4 Experimental Results

And in Chapter 2 and 3, we have specifically mentioned the definition of MHPH and the formulation of Adaboost training. In this chapter, we use Adaboost to identify of four human actions (see Fig. 4.2) including walking, squat down, running, and falling down. Furthermore, we will compare the proposed method with the algorithm proposed by Fathi [19] which apply Adaboost to recognize human actions. Figure 4.1 illustrate the system block of the proposed system.

Action VIDEO

MPEG2 Decoder

Motion Vector of P‐Frame

Unit of three GOP Convert to MHPH

Create Positive and Negative Samples

MHPH of Walk Database

MHPH of Squat Down Database MHPH of Fall

Down Database

MHPH of Other motion Database AdaBoost Training

and Get Casade

Recognize MHPH of Video

Use Adaboost OpenCV

Fig. 4.1 System block of the proposed system

(36)

(a)

(b)

(c)

(d)

Fig. 4.2 Images of different behavior such as (a) walking, (b) squat down, (3) falling down, (d) Others: Jumping

4.1 Construction of MHPH

In Adaboost algorithm, huge amount of positive and negative sample are required to train a strong classifier. From each set of decoded videos, we extract the

(37)

describes the quantization specifications of MHPH for each kind of human action.

Table 4.2 describes the specification of grey value mapping for each kind of human action.

Table 4.1 Action amount of MB in polar histogram

Amount of Motion Walk Squat Down Fall Down Run

No Motion <=5 <=1 <=2 <=5

More Slight Motion >5 || <=10 >1 || <=4 >2 || <=4 >5 || <=5 Slight Motion >10 || <=20 >4 || <=8 >4 || <=7 >10 || <=10 Huge Motion >20 || <=30 >8 || <=12 >7 || <=10 >15 || <=15 More Huge Motion >30 >12 >10 >20

Table 4.2 Gary value of distribute of GOP corresponding quantity of MB

Amount of Motion GOP1 GOP2 GOP3

No Motion 255 255 255

More Slight Motion 231 210 190

Slight Motion 168 148 126

Huge Motion 105 85 64

More Huge Motion 42 21 0

4.2 Training Process for Adaboost Classifier

By using 150 sets of MHPH from 15 video clips of walking behavior video format, 100 sets of MHPH from 14 segments of squatting down behavior, 130 sets of MHPH from 10 segments of falling down behavior, 70 sets of MHPH from 5 segments of running behavior, and 120 sets of MHPH from 11 segments of other behaviors; the following shows the stages of training and identification;

(38)

Step 1: Classifying 100 sets of MHPH for walking, falling down, running and squatting down as positive images, and 50 sets of MHPH for other behaviors as negative images.

Step 2: Before Adaboost is trained, every positive images must be checked to confirm that the main features in positive images is correctly highlighted, and is given the right values. Using create sample in OpenCV [ 20] to create the positive sample needed for highlighting the featured area.

Step 3: Using the highlighted positive sample in Step 2, OpenCV[20] is used to create positive sample of every behaviors. This consists of: 100 sets of positive sample for the behavior of walking, 100 sets of positive sample for the behavior of falling down, 50 sets of positive sample for the behavior of running, and 100 sets of positive sample for the behavior of squatting down.

(39)

Fig. 4.8 Positive sample create as according to the MHPH of squatting down.

Step 4: Using OpenCV[20] and positive images is trained to become effective classifier; during training, hit rate of various behaviors are designated at the lowest value of 0.995, and false alarm rate is designated at the highest value of 0.5. Firstly, openCV[20] will isolate on the haar-feature resulted by the quantity of the sample during training. Adaboost will focus on the training of the background area. This is because the concept of Adaboost is to first utilizing the feature to eliminate the large portion of background information, then it will train areas which are more difficult to distinguish. For example, with the behavior of walking, the first stage took 15 seconds and 2 features to eliminate the background information. The second stage took 30 seconds and 3 features to eliminate the background information. The third stage took 4 minutes and 7 features to eliminate the background information. Upon the 9 stage, it took 17 hours and 21 features to eliminate the background information; example with the behavior of squatting down, the first stage took 21 seconds and 3 features to eliminate the background

(40)

information. The second stage took 41 seconds and 4 features to eliminate the background information. The third stage took 7 minutes and 7 features to eliminate the background information. Upon the 14 stage, it took 27 hours and 32 features to eliminate the background information; example with the behavior of falling down, the first stage took 13 seconds and 4 features to eliminate the background information. The second stage took 37 seconds and 5 features to eliminate the background information. The third stage took 2 minutes and 7 features to eliminate the background information. Upon the 12 stage, it took 23 hours and 27 features to eliminate the background information; another example with the behavior of running, the first stage took 21 seconds and 2 features to eliminate the background information. The second stage took 33 seconds and 3 features to eliminate the background information. The third stage took 4 minutes and 4 features to eliminate the background information. Upon the 7 stage, it took 9 hours and 19 features to eliminate the background information; all input data used as training reference with influence the results after training, and the efficiency of execution. Main influencing attributes are;

(41)

positive and negative images will require 3 hours, but to train 100 sets of positive and negative images will require 1 working day.

(2)The ratio of positive images to negative images quantity: During training, with experiences from previous studies, the quantity of positive images is twice the quantity of negative images. Classifier trained under such conditions has better results as more negative images helps to narrow the possibility of correct areas.

(3)Sample Size: during training, a broader range can be obtained with larger sample size. Relatively the time consumed is higher during training and identification. Under the same condition, sample size 10 training duration is shorter, but accuracy is lower. Sample size 20x20 has a longer training duration and is more accurate.

(4)Level of stage: The quantity of sub-cascade determined the efficiency of the classifier. But longer duration is required.

(5)The value of minimum hit rate and Maximum false alarm: Accuracy of the value will determinate the efficiency of the classifier. When the minimum value is high, or when the Maximum false alarm value is small, the higher the standard to achieve is set higher, the classifier will produced better results.

(6)Basic or all mode choices: Upright feature is the main Extended Sets of

(42)

Haar-like Feature in Basic mode. All modes include 45 degree rotated feature set other than upright feature; Training is faster with Basic mode but with poor detection rate. Training is longer with all mode but the classifier it produced have better detection rate.

Therefore we proceed with the training stage by using different attributes to initiate the training. Using different classifier produced for detection, the results are observed and the variation in the detection rate.

Step 4: Using 50 sets of MHPH derived from 5 segments of video, classifiers produced for detection in Stage 3 are used for detection of different behaviors.

Table 4.3 Results of walking detection Classifiers Correct

detection

Incorrect detection

Rate of accuracy

Duration Image Number：30，

Sample Size：10x10，

Stage Number：3

31 19 62% Train Duration:30 mins Detect Duration: 1sec Image Number：50，

Stage Number：5

37 13 74% Train Duration:4 Hour Detect Duration: 3 sec Image Number：100，

Stage Number：9

47 3 94% Train Duration:1 Day Detect Duration: 5 sec

(43)

Stage Number：3 Image Number：50，

Stage Number：5

Stage Number：12

Table 4.4 Results of falling down detection Classifiers Correct

detection

Incorrect detection

Rate of accuracy

Stage Number：3

29 21 58% Train Duration:30 min Detect Duration: 1 sec Image Number：50，

Stage Number：5

Stage Number：14

Table 4.4 Results of running detection Classifiers Correct

detection

Incorrect detection

Rate of accuracy

Stage Number：3

28 22 56% Train Duration:30 min Detect Duration: 1 sec Image Number：30，

Stage Number：5

Stage Number：7

With observation from the walking detection in Table 4.3, when training data is

(44)

although training duration is short, rate of accuracy in detection is also lower. Rate of accuracy in detection for squatting down and falling down is lower when compared with the results of walking detection. This is due to the variation in motion direction provided as training data, such as in the hands, legs, and body in falling down and squatting down as to walking. Thus, error in the detection is caused. However if adequately large volume of training data is used, this error can be resolved. We also observed in our results that Adaboost, even though it is limited by the accuracy of the classifiers, and is commonly used for facial recognition, the duration it took to do not have a big difference. It took an average 1~3 seconds to detect one image. This result shows that Adaboost can be used for behavioral detection. This is because the greatest problem in behavioral monitoring is the ability to accurately detect a specific behavior in real time. But with our formulation, accurate real time behavioral detection is possible.

Next, we compared the proposed method of Alireza Fathi and Greg Mori with our method. The proposed method of Alireza Fathi and Greg Mori is the utilization of low-level optical flow information of uncompressed video data to for the structure of

(45)

same as the proposed method of Alireza Fathi et al.[19]. But in the case of behavior detection: running towards the right and running towards the left, our method is superior than the proposed method of Alireza Fathi et al; meanwhile, our method is able to detect using compressed video data, while the proposed method of Alireza Fathi et al can only detect behavior using uncompressed video data.

Table 4.5 Accuracy rate of our method compare with Fathi [19]

Action Our Method Alireza Fathi et al

Running 72% 72%

Walking 94% 92%

Table 4.5 Summary of our method with Fathi[20]

Compare Item Our Method Alireza Fathi et al Method Adaboost Multi-Adaboost Video Type Compress Video Un-compress Video

Training Feature MPHP Mid-Level motion feature Noise reduce Reduce slight noise No handle noise part

The problem faced by using a simple tool such as Adaboost to detect individual and singular behavior such as walking and not walking, falling down and not falling down, is that people may walked a certain distance before demonstrating such behaviors. We can only detect specific behavior purely in a designated duration. Other than that, more detailed detection is difficult. Therefore multi-adaboost can considered in later studies for training and detection.

(46)

Chapter 5 Conclusion

Based on the testing results from the last chapter, we can see that, (1) the characteristic of MHPH extracted from compressed video data can be trained and is able to distinguish human behavior with adaboost, and the result is more accurate than the proposed method of Alireza Fathi [19] and Greg Mori. (2) We emphasized that our method uses compressed video data to train and detect human behavior, while the proposed method of [19] Alireza Fathi and Greg Mori can only use uncompressed video data; (3) With adequate training data, our method can be used to train and detect MHPH of various human behaviors, which is more effective than human detection methods proposed in [16][17][18]; (4) in the case of information processing, as MHPH will assign different grey values to different motion, thus smaller values of motion will be eliminated.

However, our method has more room for improvement in the future. (1) in terms of accuracy, we can only distinguish the true false value of a single behavior using various filters. But the concept of using Multi-adaboost by [19] can detect

(47)

circular MHPH. Therefore if haar-like feature can be modified in reference to MHPH into non-circular or 45 degree angled, accuracy in detection will greatly improved; (3) As our training data focused on the behavior of a single human being, video data with 2 or more people will be impossible to detect. Thus this is one of the main problems we have to overcome.

(48)

Reference

[1] R. V. Babu, B. Anantharaman, K. R. Ramakrishnan and S. H. Sinivasan,

“Compressed domain action classification using HMM,” Pattern Recognition Letters, Vol. 23, Issue: 10, pp. 1203-1213, Aug. 2002.

[2] R. V. Babu and K. R. Ramakrishnan, “Content based video retrieval using motion descriptors extracted from compressed domain,” Circuits and Systems, 2002.

ISCAS2002. IEEE International Symposium, Vol. 4, pp. IV-141 - IV-144, 26-29

May 2002.

[3] P. Viola and M. J. Jones, “Robust real-time face detection, ＂ International Journal of Computer Vision, Vol. 57, No. 2, pp. 137-154, 2004.

[4] Freund, Y., and Schapire, R., “A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting, ＂Journal of Computer and System Sciences, 55(1), pp. 119–139, Aug. 1997.

[5] Alexander Kuranov, Rainer Lienhart, and Vadim Pisarevsky, “An Empirical Analysis of Boosting Algorithms for Rapid Objects with an Extended Set of Haar-like Features.” Intel Technical Report MRL-TR-July02-01, 2002.

(49)

on Pattern Recognition, Vol. 4, pp. 913-916, Aug. 23-26, 2004.

[7] J. A. Freer, B. J. Beggs, H. L. Fernandez-Canque, F. Chevrier, and A. Goryashko,

“Automatic video surveillance with intelligent scene monitoring and intruder detection,” Proc. 30th Annual 1996 International Carnahan Conference, pp.

89-94, October 1996.

[8] A.Divakaran, K. Vetro, K. Asal and H. Nishikawa, “Video Browsing System Based on Compressed Domain Feature Extraction,” IEEE Transactions on Consumer Electronics, Vol. 46, No. 3, pp. 637-644, Aug. 2000.

[9] R. T. Collins, A. J. Lipton, T. Kanade, H. Fujiyoshi, D. Duggins, Y. Tsin, D.

Tolliver, N. Enomoto, O. Hasegawa, P. Burt, and L. Wixson, “A system for video surveillance and monitoring: VSAM Final Report,” Technical Report CMU-RI-TR-00-12, Robotics Institute, Carnegie Mellon University, May 2000.

[10] D. Liu, P. C. Chung, and Y. N. Chung, “Human Home Behavior Interpretation from Video Streams,” Networking, Sensing and Control, 2004 IEEE International Conference, Vol. 1, pp. 192-197, Mar. 21-23, 2004.

[11] J. Yamato, J. Ohya and K. Ishii, “Recognizing human action in time-sequential images using hidden Markov model,” Computer Vision and Pattern Recognition 1992. Proceedings CVPR '92, 1992 IEEE Computer Society Conference on, pp.

379-385, June 15-18, 1992.

(50)

[12] R. Rosales,“Recognition of Human Action Using Moment-Based Features,”

Technical Report BU 98-020, Boston University, Computer Science, 1998.

[13] M. Umeda, “Recognition of Multi-Font Printed Chinese Characters,” In Proc.

6th ICPR, pp. 793-796, 1982.

[14] R. V. Babu, B. Anantharaman, K. R. Ramakrishnan and S. H. Srinivasan,

“Compressed domain action classification using HMM,” Pattern Recognition Letters, Vol. 23, Issue: 10, pp. 1203-1213, Aug. 2002.

[15] J. Morris, M. J. Lee, and A. G Constantinides, “Graph theory for image analysis:

an approach base on the shortest spanning tree,” Proc. Inst. Elect. Eng., Vol. 133, pp. 146-152, Apr. 1986.

[16] Viola P. Jones, M.J., Snow D., “Detecting Pedestrians Using Patterns of Motion and Appearance,” IEEE International Conference on Computer Vision (ICCV), Vol. 2, pp. 734-741, October 2003.

[17] N. Dalal and B. Triggs, “Histograms of Oriented Gradients for Human Detection,” Proc. IEEE International Conference on Computer Vision and pattern Recognition 2005.

(51)

Features,” IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 2008.

[20] http://sourceforge.net/projects/opencvlibrary/, Face Detection library.

中 華 大 學

中 華 大 學 碩 士 論 文

題目: 於壓縮域影片進行人類行為分析 Human Action Analysis Using Motion History

Polar Histogram

系 所 別：資訊工程學系碩士班 學號姓名：M09302010 傅郁婷 指導教授：連 振 昌 博士

中華民國 九十七 年 八 月

摘要

目錄

第一章 簡介

第二章

人類動作特徵擷取

第三章

利用 AdaBoost 進行訓練與辨識

第四章 實驗結果

第五章 結論

英文附錄

Human Action Analysis Using Motion History Polar Histogram

Prepared by Yu-Ting Fu Directed by

Dr. Cheng-Chang Lien

Computer Science and Information Engineering Chung Hua University

Hsin-Chu, Taiwan, R.O.C.

August, 2008

Abstract

Chapter 1 Introduction

Chapter 2 Motion Feature Extraction

( ) ( ) ∑

( ( ) ( ) )

．

( ) ( ) ∑

( ( ) ( ) )

( )

( )

Chapter 3 Human Action Recognition using Adaboost

．

．

．

．

{

}

∑

∑

∑

∑

Chapter 4 Experimental Results

Chapter 5 Conclusion

Reference

中華大學

中華大學碩士論文

系所別：資訊工程學系碩士班學號姓名：M09302010 傅郁婷指導教授：連振昌博士

中華民國九十七年八月

第一章簡介

第四章實驗結果

第五章結論

( ) ^{( )} _∑

( ^{( )} ^{( )} )

( ) ^{( )} _∑

( ^{( )} ^{( )} )