SVM classifier - Human detection - Object extraction and Human detection

Chapter 3 Object extraction and Human detection

3.2 Human detection

3.2.3 SVM classifier

Support Vector Machines (SVMS) are developed to solve the classification and regression problem. SVM is a way which starts with a linear separable problem. For classification, the goal of SVM is to separate two classes by a function which is induced available example. Consider the example in Fig. 3-6, there are two classes of data and many possible linear classifier that can separate these data, but only one of them is the best classifier which can maximize the distance between two class-margin, this linear classifier is called optimal separate hyperplane. Fig. 3-7 shows the accuracy rate and the number of SV (support vector) foe all number of ICA to be 76. Based on this experiment result, the accuracy rate of human detection system will increase if number of IC is increasing. But in practice, we select 30 ICs because number of SVs is almost close to minimum and the accuracy rate more than 90%. The main reason to choose minimum number of SVs is to decrease the computation cost, in order to meet the real-time requirement.

Fig. 3-6 Optimal separating hyperplane

(a)

(b)

Fig. 3-7 Analysis of feature selection. (a) Accuracy rate. (b) Number of SV.

Chapter 4

Fig. 4-1 Human tracking flow chart

A human tracking system based on mean-shift algorithm is proposed in this thesis. The mean-shift algorithm is a simple iterative procedure that shifts each data point in its neighborhood and locating the maxima of a density function given discrete data sampled from that function [36]. Generally, the mean-shift algorithm uses color feature but in our thesis, we combine color feature and ICs (Independent Components analysis) feature. The idea behind the combination is come from human detection based on ICs feature in chap3. Although color feature able to track moving object but the combination color and ICs able to track not only moving object but moving human.

The Kalman filter is applied to predict the location of moving human in next frame. The Kalman filter actives when moving human partially occluded with other object, in other hand the mean-shift similarity value decrease until smaller than a

4.1 features

4.1.1 color feature

Most kernel based object tracking use color as feature. Color information extends into three dimensions of original grey scale image so it will increase good tracking performance. Most papers use mean-shift as tracking algorithm usually consider color as feature to accomplish object lock. We compare the performance of three color space: RGB, HSV and Y’UV. Based on experiment, the HSV color space give a good performance while tracking, thus we apply it on tracking with active camera.

HSV color transformation:

The main purpose of HSV color space transformation is to reduce the sensitivity of illumination or lightness information in RGB color space. In Fig. 3-6, Fig. 3-7 are RGB and HSV color space, respectively.

The HSV model, also known as HSB model, was created in 1978 by Alvy Ray Smith. It is a nonlinear transformation of the RGB color space. It defines a color space in terms of three components: hue, saturation, and value. The definition is described below: [16]

1. Hue: It is the color type and ranges from 0 ~ 360 degree. Each value corresponds to one color. For example, 0 is red, 45 is orange and 55 yellow. When it comes to 360 degree, it is also equal to 0 degree.

2. Saturation: It is the intensity of the color, and ranges from 0%~100%. 0 means no color, and that means only gray value between black and white exists. 100 means the intense color with the most color variety.

3. Value: It is the brightness of the color, and also ranges from 0%~100%. 0 is

always black. Depending on the saturation, 100 may be white or a more or less saturated color.

(a) (b)

Fig.4-2 (a) RGB color model [17] (b)HSV color model [18]

The transformation algorithm from RGB color model to HSV color model could be described in the following equation.

0 where MAX, MIN are maximum and minimum value of (R, G, B), respectively.

Y’UV color transformation

The Y’UV color model defines a color space in terms of one luma (Y') and two chrominance (UV) components. The Y'UV color model is used in the NTSC, PAL, and SECAM composite color video standards. Y' stands for the brightness component and U and V are the chrominance components. The transform equation are show

V =MAX

following:

Each channel of color space has 8-bits data, means (255x255x255). We quantize the color space into (16x16x16). Therefore the histogram of color feature consists of 4096 bins.

4.1.2 ICA feature

In chapter 3, the human detection algorithm uses entropy feature selection to select 30 ICs from ICs. These 30 ICs are process by SVM classifier to classify whether the foreground object is human or nonhuman. If SVM classifier is classify the foreground object as human then these will use in mean-shift tracking algorithm.

In practice, the combination of color and ICs will construct a histogram with total bins equal to 4126. The combination procedure is described as follow:

1. Select the ROT (Region of interest), in this case is human’s ROI.

2. Compute the color histogram by kernel function, as shown in Fig. 4-3 (b) 3. ICs extracted.

4. Combined the histogram of color and ICs features, as shown in Fig. 4-4

0 Fig. 4-3 (a) Human’s ROI (b) Kernel function (c) Histogram of color feature

Fig. 4-4 Color and ICA feature histogram

4.2 Kernel functions

The feature histogram based target representations are regularized by spatial masking with an isotropic kernel. The masking induces spatial-smooth similarity functions suitable for gradient-based optimization, hence, the target localization problem can be formulated using the basin of attraction of the local maxima [19].

There are many kinds of kernel functions, such as Gaussian kernel, Flat kernel and Epanechnikov kernel. Let x be normalized pixel as location in the region defined as target model, then the Gaussian kernel, Flat kernel and Epanechnikov kernel [36]

are defined as follows.

3. Epanechnikov kernel

1 || || 1

Fig. 4-5 (a) and (c) show that the Gaussian and Epanechnikov kernel are similar.

They have highest value are the center distribution. If we take a looking the ROI of target model, the more closer to the center of ROI is containing more important pixels

Therefore, Gaussian and Epanechnikov kernel can regardless the boundary information and the accuracy will larger than flat kernel.

(a) (b) (c)

Fig. 4-5 (a) Gaussian kernel (b) flat kernel (c) Epanechnikov kernel

(a) (b) (c)

Fig. 4-6 (a) Target object (b) Kernel function (c) Target object and Kernel function

4.3 Mean-shift algorithm

In order to characterize the target, first a feature space is chosen. The reference target model is represent by normalized histogram q in the feature space. The target model can be considered as centered at the spatial location 0. In the subsequent frame, the candidate model is defined at location y, be expressed as p(y). We use Eq. 4.8 and Eq. 4.9 as our target and candidate model, respectively.

^ ^

The similarity value between and

Let be normalized pixel locations in the region defined as the target model. The region is centered at 0. Here we use Epanechnikov kernel , using these weights increases the robustness of the density estimation since the peripheral pixels are the least reliable.

The function associates to the pixel at location

^ ^

in the quantized featu r

^ 2

where δ is the Kronecker delta function. The normalization constant C is derived by imposing the condition from where

since the summation of delta functions for u=1…m is equal to one.

Candidate model

Let be the normalized pixel locations of the candidate model, center at y in the current frame. The normalization is succeed from the frame containing the target model. Here we use Epanechnikov kernel same as target model, but with bandwidth h,

{ }x_i _i₌1..._n

the probability of the feature u=1…m in the candidate model is given by

is the normalization constant. Note the does not depend on y because the pixel locations

x are organized in a regui y is one of the lattice nodes. The bandwidth h defined the scale of the target candidate.

Similarity measure

The similarity function defines a distance among target model and candidate model. The Bhattacharyya coefficient, which evaluates the similarity of the target model and the candidate model, is defined as

lar lattice and

To find the location corresponding to the target in the current frame, the Bhattacharyya coefficient in Eq. (4.16) should be maximized as function of y which can be solved by running the mean-shift iterations.

Object localization

Color and ICs information were chosen as the features, however, the same framework can be used for texture and edges, or any combination of them. In the sequential, it is assumed that the following information is available: a. d detection and localization of the objects track in the initial frame b. Every objects periodic analysis accounting for possible updates of the target models due to significant changes in color.

Minimize the distance d(y) is equivalent to maximizing the Bhattacharyya coefficient

. The search for the new target location in the current frame starts at the location y of the target in the previous frame. So the probabilities of the candidate at location

^ ^

0 1...

{p_u(y )}_u₌ _m

y in the current frame have to be computed first. Using Taylor 0

expansion around the values

^ ^

In order to minimize the distance d(y), the second term in Eq. 4.17 has to be the maximized, the first term being independent of y. The second term represents the density estimate computed kernel function at y in the current frame, with the data being weighted by Eq. 4.18. In this process, the kernel is recursively moved from current location

y according to the relation. The distance 1

between

Fig. 4-7 Mean-shift algorithm flow chart Given the target model and its location

{ }q_u _u₌1..._m

y0in the previous frame, set initial previous similarity value equal to 0, then the mean-shift algorithm is described as following,

1. Initialize the location of the target in the current frame with

3. Find the next location of the candidate according to Eq. 4.19.

4. Compute ,and evaluate

Evaluate

[ ( ), ]1

p y q

ρ and go to step 2.

Otherwise

^ 0

y ← and break. y1

4.4 ROI resizing

In real situation, human probably walk toward or keep away from camera, thus fixed ROI’s scale is not suitable because the ROI will contain some background pixels or only some parts of human. Consequently, it will influence tracking result shown in Fig. 4-8 and a similarity value smaller than a threshold value. Therefore, the ability to resize ROI’s scale is an important issue. In our system the ROI scale is adjusting every 100 frames.

Fig. 4-8 ROI scale larger than object

Fig. 4-9 ROI resize flow chart

In order to adjust ROI scale adaptively, first we use temporal difference to find the foreground image and its position we do the temporal difference only in ROI’s regions. The dilation process is applied to gradually enlarge the boundaries of for ground pixel and link the broken boundary parts which obtain temporal difference.

The projection of dilation image into x-axis and y-axis will produce the current width (width_current) and height (

is determ

) shown in Fig 4-10. Finally, the new ROI’s size ined with Eq. 4.20. Som times, the dilation process unable to link broken , th ward x-axis and y-axis will produce ROI actual hum So we will set the minimum size of ROI. After the s scale is determ rget model will be update, too. The update thod is use histogram shown in Eq. 4.21, where α is set to 0.6.

Fig. 4-10 X-axis and Y-axis image projection

In Fig. 4-11 and 4-12 show the ROI resize in fixed and active camera, respectively.

(a)

(b)

Fig. 4-11 Fixed camera condition (a) and (b) left is difference image. Right is after resize scale.

(a)

(b)

(c)

(d)

(e)

Fig. 4-12 Active camera condition. (a) original image (c-e) left are difference images.

Right is ROI resizing scale image

4.5 Kalman filter

Fig. 4-13 Kalman filter flow chart

Fig. 4-13 shows the Kalman filter algorithm with use to predict the target location when the target is occluded with other object or if Bhattacharyya similarity value smaller than a threshold value (in this case we use threshold value equal to 0.65).

if the Count smaller than 5, then it means object was occluded and the predicted

frames the target model histogram will be updated by Eq. 4.21.

In our system, the Kalman filter is integrated into mean-shift object tracking method. First, Kalman filter initialized by mean-shift target position. Second, the searching result of mean-shift is feedback as the measurement of Kalman filter and estimating its parameters.

We assume that and are Gaussian random variable with zero mean, so their probability density function are N[0,Q(t-1)] and N[0,R(t)], where the covariance matrix Q(t-1) and R(t) are referred to as the transition noise covariance matrix and measurement noise covariance matrix. Here

Wk V_k

[ , , , ]^T

Xk = x y Vx Vy is state of the system at the moment k, Z_k =[ , ]x y^T is measurement value of system state at the moment k.

x and V_x are the horizontal position and veloci . The value of state transition matrix A, measurement matrix H, process noise covariance matrix Q and measurement noise covariance matrix R list as following Eq. 4.23,

ty respectively

The detail procedures of Kalman position prediction are listed bellow:

1. Predict the position of the target at moment k by kalman filter, and compute the prior error estimate covariance.

2. Centered with predicted position

^ '

xk , acquire the observation value Z_kaccording to Eq. 4.22(b).

3. Correct measurement with Kalman filter, compute the revision matrix and renew poster state estimation as well as posterior error estimate covariance.

' ' Here Vx and Vy are x and y motions, respectively. In most application, we usually consider current and previous frame motion. If the moving object move in the same direction, using two frames motion to predict new position will obtain small error with respect to actual position. If the moving object moves in different direction use two frame motion are not good enough to represent because the predicted position.

Here we consider more frames motion to get more accuracy, therefore we choose 5 frames motion’s average to get the more represent of moving direction.

/ f (4.26)

In both formulas, xi and yi are the horizontal and vertical coordinate of the target center respectively, continuous frames. The distance between Kalman and mean-shift position are calculated by Euclidean distance.

In this section we use the position predicted by kalman filter compare with mean-shift. This comparison is helpful us to know the accuracy of the predicted position. We use Eq. 4-27 to calculate the error between mean-shift and Kalman filter.

| |

n i i i

x x

MAE n

−

∑

(4.27)

Where xi is mean-shift position and x^ is predicted position by Kalman filter.

We do experiment both in indoor and outdoor environments. In the indoor environment, the human is not walking with a certain direction or path. Meanwhile, in the outdoor environment, the human is walking with the same direction.

(a)

(b)

(c)

(d)

Fig. 4-14 Indoor environment (a) frame 643 and 662 (b) frame 679 and 710 (c) frame 718 and 728 (d) frame 761 and 797

The MAE is calculated separately in x-direction and y-direction Fig.4-15 and Fig.

4-17 show the curves of x and y-position, where green line and blue line indicate mean-shift and Kalman filter prediction, respectively. The MAE position analysis uses 2 and 5-previous frames motion. Table 4-1 shows the MAE of 5-previous frame motion is smaller than 2-previous frame motion. Meanwhile, Table 4-2 shows that the MAE of 2 and 5-previous frame motion do not differ greatly. The reason is the MAE in Table 4-1 are generated from human that walking in several directions, thus using as much as possible frame to compute Kalman prediction will produce position better than 2-previous frame. But in the case of Table 4-2, the human is walking with the same direction , thus 2-previous frame enough to represent the Kalman prediction.

(a)

(b)

Fig. 4-15 (a) previous 2 frame motion left is x and right y position (b) previous 5 frame motion left is x and right y position

Table 4-1 MAE in different frame motion

X position error (pixel) Y position error (pixel) distance

Previous 2 frame motion 18.25 3.74 18.9

Previous 5 frame motion 5.02 2.15 6.02

(a)

(b)

(c)

(d)

Fig. 4-16 Outdoor environment (a) frame 708 and 738 (b) frame 762 and 783 (c)frame 812 and 837 (d) frame 847 and 857

(a)

(b)

Fig. 4-17 (a) previous 2 frame motion left is x and right y position (b) previous 5 frame motion left is x and right y position

Table 4-2 MAE in different frame motion

X position error (pixel) Y position error (pixel) distance

Previous 2 frame motion 3.98 3.6 5.89

Previous 5 frame motion 4.07 3.619 5.49

Chapter 5 Experimental results

In this chapter, we will reveal the human detection and tracking system on active camera. Our algorithm was implemented on the platform of PC with Intel Core2 Quad 2.4GHz and 2GB RAM. Our algorithm was developed in Borland C++ Builder 6.0 on Window XP. Because our human detection and tracking system will run in real time video surveillance with an active pan-tilt-zoom camera, we should do some experiments to test its performance and stability under several kinds of environments.

In section 5.1, introduce the experimental environment. In section 5.2, we will experiment our kalman position predict and compare with mean-shift position and calculate their error by MAE (Mean absolute error). The kalman filter used in real-time situation to solve object occlusion problem. In section 5.3, we use experiment single object in the scenario. In section 5.4, we use experiment multiple objects in the scenario.

5.1 Environment setup

The environment of experimental locates in our laboratory. The complexity of the environment is enough to verify our system while tracking and detecting moving human. Fig. 5-1 show several images of our laboratory environment without zoom in/out operation. Fig. 5-2 shows several images for zoom in/out condition.

Fig. 5-1 Experimental environment

Fig. 5-2 Experiment zoom condition

5.2 Kalman filter

The following two figures (Fig. 5-3 and Fig. 5-4) are compared under occlusion situation the difference between Kalman filter and no Kalman filter. In Fig. 5-3 We can observe a human pass the red screen will result the Bhattacharyya coefficient small than threshold so in frame 227 the mean-shift tracking will miss lock the object under occlusion problem. Fig. 5-4 Kalman filter was embedded in mean-shift algorithm the occlusion problem can be solve in frame 227 as shown in Table 5-1. We can also observe the object occluded for long time but the human was still be locked.

Because the target object histogram be update so that kalman filter is not always be used in occlusion situation. Using histogram update idea in occlusion will increase tracking accuracy and precision.

There is one occlusion testing case in indoor environment. A man walking and then be occluded by a tall and long screen. In this situation the similarity measure will be drastically low so that the Kalman filter will be used to predict position.

Table 5-1 Predicted position

Frame number Similarity Object center

225 0.6874 (167,153)

226 0.6366 (165,153)

227 0.5933 (176,143)

259 0.6937 (112,118)

(a)

(b)

(c)

(d)

Fig. 5-3 Human tracking use mean-shift. (a) frame 194 and 206 (b) frame 225 and 234 (c) frame 243 and 246 (d) frame 255 and 277

(a)

(b)

(c)

(d)

(e)

Fig. 5-4 Human tracking use mean-shift and Kalman filter. (a) Frame 186 and 210 (b)Frame 224 and 227 (c) Frame 229 and 233 (d) Frame 255 and 263 (e)Frame 269 and 278

5.3 Active camera with single object experiment

In this section we experiment a single objects move in the scenario that active camera will smoothly trace the object. So there 5 topics to discuss one object in various situation that have different performance.

5.3.1 9 regions and 26 regions experiments

There are many methods to control the active camera direction, for example motion or position … etc. In this thesis we use position based method to control the camera pan/tilt directions. The image size is 320x240 so we divide into 9 and 25 regions. Every region implies a direction and speed so that the object in one of these regions the algorithm sends command to active camera tell it to pan/tilt.

In 9 regions, the speed of each regions have fixed speed and the speed was be stopped by stop command that means the pan/tilt angle was limited by stop command so in visual situation we can observe the camera stop and go repeat forever shown in Fig. 5-5. The stop and go phenomenon that result in observer uncomfortable. The 9 regions direction are show in Fig. 5-6. In real situation the object will be missed

In order to solve stop and go phenomenon, the regions of image divides into 25 regions shown in chap2 Fig. 2-5. Each regions have different speed and the stop

在文檔中結合獨立成份與色彩特徵之平均移動向量人形追蹤演算法-應用於主動式攝影機 (頁 30-0)