Chapter 1 Introduction
1.2 Related Work
Fig. 1-1 Three categories of tracking
In recent years, many human detection systems are developed. The human detection system determines the object position and size in the image. A good human detection system can provide valuable insight on how one might approach other similar feature and pattern detection system. There are two parts of human detection system, segmentation of moving object from background and discrimination of humans from nonhuman objects. A moving object occurred in the image will be extracted from background image in the human detection system. There are many human detection method are used to detect the human that presence in an image.
Optical flow is used to estimate independently moving object, but it expense complex computation and sensitive to change of intensity. Optical flow is used in [2,6] to detect vehicle. Zhao et al [8] exploited stereo based segmentation algorithm to extract object from background and to recognize the object by neural network based recognition. Although stereo vision based technique have been proved to be more robust it require at least two cameras and can be used only for short and middle
distance detection. Dalal and Triggs [7] use gray scale image to get edge image and using it to extract orientated gradient. They select the dominant orientated gradients to detect human. In [8] a stereo-based segmentation algorithm is used to extract objects from the background, followed by a neural network-based recognition. Sebastian and Alvaro [10] present a new computer vision algorithm designed to operate with moving cameras and to detect humans in different poses under partial or complete view of the human body. Boosting detector cascades have been introduced by Viola et al. [13]. In this, AdaBoost is used in each layer to iteratively construct a strong
classifier guided by user-specified performance criteria. Viola et al. contribute to the high processing speed of the cascade approach, since usually only a few feature evaluations in the early cascade layers are necessary to quickly reject non-pedestrian examples. In [14] presents a template-based approach to detecting human silhouettes in a specific walking pose, templates consist of short sequences of 2D silhouettes obtained from motion capture data.
Human tracking system is used to follow target human through the sequence images in terms of changes in scale and position. As mentioned in introduction, there are two type of real-time tracking system. One is tracking object with fixed camera and the other is with an active camera. Tracking object with an active camera can keep the object in the scene of camera by drive pan and tilt. The active camera also have zoom in/out function that can used to zoom in/out object when the object’s image resolution is too small to track. The active camera used in surveillance system needs to achieve real-time tracking. If an tracking algorithm expense more computation time then the active camera will not track object immediately. Many tracking system are used active camera work only on pan-tilt or zoom. The objective of our approach is to correctly select the target human, and drive pan/tilt to keep target
maximum size the zoom function will operate zoom-in or zoom-out. There are many application of active camera. Murray et al. [30] utilized morphological filtering of motion images for background compensation. This motion tracking method can track a moving object from dynamic images with pan/tilt angles. C. Lin et al. [31] use an image mosaic technique to track moving objects with a single pan/tilt camera indoors.
Collins et al. [32] developed a system with multiple cameras that tracked a moving figure using pan/tilt cameras alone. This system used a kernel-based tracking approach to overcome the apparent motion of the background as the camera moved. L.
Fiore et al. [33] use wide angle and active camera to achieve human tracking. In this work the target object was found by wide angle camera and through camera calibration method tell active camera the pan/tilt angle to track the object.
In Fig. 1-1, there are three kinds of tracking method. Feature based is the most commonly method. Color, edge or motion are commonly used as tracking feature. The edge detection methods, such as Sobel method [3], Laplacian method [3], and Marr–Hildreth method [4] etc., utilize masks to do convolution on the image to detect the edges based on the abrupt change of the gray level. Wei Guo et al. [1] proposed human tracking system based on shape analysis. Law et al. [5] design fuzzy rules use in edge based human tracking, although this method requires a rather large and complicated rules set. However, these methods are need more computation time and edge pixels can not be always detected continuously. We noted that all those methods mentioned above detect edges using gray level images, and those methods will be neglect for color images because the representation of a pixel is not only a gray level but a vector in a color space. The edge occurring in the adjacent pixels which have the same values in any one color component may not be detected. So edge detection only in gray level image is not sufficient and robust. Pattern recognition learning the target object to search them in sequence image can achieve human tracking. Williams et al.
[26] extended the approach to the nonlinear translation predictors learned by Relevance Vector Machine. Agarwal and Triggs [27] used RVM to learn the linear and nonlinear mapping for tracking of 3D human poses from silhouettes. Bohyung Han and Larry Davis [28] use PCA to extract feature from color and use these feature in mean-shift algorithm to implement object tracking. Robert T. et al. [29] presents an online feature selection mechanism for evaluating multiple features while tracking and adjusting the set of features used to improve tracking performance. Their feature evaluation mechanism is embedded in a mean-shift tracking system adaptively selects features for tracking. The mean-shift algorithm was originally proposed by Fukunaga and Hosterler [9] for clustering data. The kernel-based object tracking proposed by Meer et al [19]. This method tracks an object region represented by a spatial weighted intensity histogram. So the target and candidate object distinguish whether they are similar by Bhattacharyya coefficient, and tracking is achieve by optimizing this objective function using the iterative mean-shift algorithm. Later, many variants of the mean-shift algorithm were proposed for various applications [20-23].
Though the mean-shift object tracking algorithm performs well on sequences with relatively small object displacement, its performance is not guaranteed when objects undergo partial or full occlusion. In order to improve the performance of mean-shift tracker, in the event of object undergoing partial occlusion, there some method was be proposed Kalman filter [34,35] and particle filter [24,25]. K.
Nummiaro et al [25] use the idea of particle filter to apply a recursive Bayesian filter based on sample sets. They use color as feature. Their work have evolved from the condensation algorithm which was developed in the computer vision community.
Plamen P. et al [15] use the mean-shift method for face detection and automated control of an active camera to follow a person’s face and keep his image centered in
5 Chapter 2
System Overview
2.1 Active camera control
6
7 Fig.2-1 Active camera control through RS-485
8 The proposed system is using active camera to track and keep human in the center of monitor screen or camera’s FOV (field of view). The active camera is controlled by pelco P-protocol [16] through RS-232 to RS-485 converter. We have to control pan (left, right direction), tilt (up, down direction) angle, and zoom’s step to achieve our tracking purpose.
9 The pelco P-protocol has 8 bytes data with message format as show in Fig. 2-2.
Byte1 and byte7 are start and stop byte, and always set to 0xA0 and 0xAF, respectively. Byte2 is the receiver or camera address. In our case, we only use one camera therefore byte2 always sets to 0x01. Byte3-6 are use to control pan-tilt-zoom ( PTZ) as shown in Fig. 2-3. The last byte is an XOR check sum byte.
10
11
On/Off Iris close Iris open Focus near Focus far
Data byte2 Fixed to 0 Zoom wide Zoom tele Tilt down Tilt up Pan left Pan right 0 (For pan/tilt) Data byte3 Pan speed $00 (stop) to $3F (high speed) and %40 for Turbo
Data byte4 Tilt speed $00 (stop) to $3F (high speed)
13
14 Fig2-3 Data byte 1 to 4 format 15
16 In this thesis, we divide the image view into 25 regions to drive PTZ and keep moving object in the center of FOV. Every region has specific direction, angle and speed as shown in Fig. 2-5. If the target is located on stop-region, then camera will set to stop. Otherwise, if the target is located on other regions. For example in the region H the camera will drive to up direction, so on. The zoom-in and zoom-out will be activated if the target’s size become smaller or larger than user’s define size. For example, if the target’s size 1.5 times larger than user’s define size then the zoom-out will activated. Otherwise, if the target’s size is 0.5 times smaller than user’s define size, then zoom-in will activated. During the period of the active camera moving, the tracking system is still running as well as the camera control system. So, if the moving object changes its direction, we can easily change the control action immediately.
17
A B C D E
F G H I J
K L Stop M N
O P Q R S
T U V W X
18 Fig.2-4 Divide screen into 25 regions 19
20
21 Fig.2-5 Control direction for each regions
2.2 Software system
22 The whole system consists of three major parts: Human detection, human tracking and PTZ control. Input frame are captured by active camera with resolution 320x240 and moving object will be extracted by background difference. We use previous 20 frames to build a background image. The independent component analysis (ICA) and support machine vector (SVM) classifier are applied to classify the moving object into human or non-human.
When a human is detected then we need to focus on it. Mean-shift algorithm will
track the target by compute the similarity value in every frames and send the position to active camera, then active camera will drive PTZ to keep the target in the center of FOV. Sometimes human will be partially or fully occluded by other object, thus the similarity will drastically decrease, consequently mean-shift will miss track the target. In this case, the Kalman filter is activated to predict the target’s position in next frame.
23
Image source part Human detection part Human tracking part PTZ control part
Pan & Tilt control Occlusion??
Static Dynamic
Fig2-2 system overview
Chapter 3
Object extraction and Human detection
24
25
Fig. 3-1 Human detection system
Chapter 3 describes moving object extraction, including preprocessing and classifying moving object into human and nonhuman. The architecture of moving object extraction was indicated in the dotted-line block in Fig. 3-1 and the remained blocks represented our human feature extraction and classification.
3.1 Moving object extraction
26 Mostly, in surveillance system, the default position of camera is fixed even the camera is an active camera, so we can use the still image as background image.
Background subtraction is the simplest way to extract moving object from an image. Besides, we used the background subtraction method in order to meet the real-time requirement.
27 Our background image was constructed by using first 20 frames. The difference for each pixel (x, y) could be calculated by
28 IBS( , ) |x y = IC( , )x y −IB( , ) |x y (3.1) 29 where IC and IB denote the current and background gray image, respectively.
IBS denotes the background subtraction image. A threshold value ths is choose to oving object
produce binary m Mobjas described in following equation.
30 (3.2)
31 The dilation process applied on
1 ( , )
Mobjto gradually enlarge the boundaries of moving object pixels (foreground). Thus areas of foreground pixels grow in size while holes within those regions become smaller.
32 i j (3.3)
T in Eq. 3.3 denotes the total foreground pixel in a (3x3) dilation mask. If at least one pixel coincides with foreground pixel, then the center of (3x3) mask is sets to the foreground value. If all the corresponding pixels are background then it will set to background value.
35 Connectivity between pixels is a fundamental concept that simplifies the definition of numerous digital image concepts, such as regions and boundaries.
To establish whether these two pixels are connected, it is determined by their neighbors and finds their gray levels satisfy a specified criterion or similarity. For instance, in binary image with values 0 and 1, two pixels maybe 4-neighbors, but they are said to be connected only if they have the same value. Connected component works by scanning an image, pixel-by-pixel in order to identify connected pixel regions and labeling the pixel that connected together with same
label. Once all groups have been determined, each pixel is labeled with a gray level or a color according to the component it was assigned to as shown in Fig.
3-2. After connected component, we observe the exist labels. In actual situation not all of them are belong to moving object pixels, therefore we use a (9x9) size filter to eliminate noise labels. The connected component without noise clearly observed in Fig. 3-2. All of these moving object extract result are show in Fig.
3-3.
36
IB
IC
Mobj ID ICC
ISF
IROI
Fig. 3-2 Moving object extraction 37
38
39 (a)
(b)
(c)
40 Fig. 3-3 Moving object extract processing. (a) Left is IB. Right is IC. (b) left is Mobj. Right is ICC. (c) Left is ISF. Right is IROI.
3.2 Human detection
We applied independent component analysis (ICA) on the foreground regions for feature extraction proposed. Then classified it into human or nonhuman by SVM classifier shown in Fig. 3-4.
ICA feature
Fig. 3-4 Human detection flow chart
3.2.1 ICA feature extraction
In recent years, ICA [38] has been applied to human feature extraction for constructing a sufficient set of features describing human beings. ICA is a statistical method for transforming an observed multidimensional random vector into components that are statistically independent. ICA is a generalization of principle component analysis (PCA), it is a high-order statistic approach. As Fig. 3-5 shows ICA transforms each input image to the combination of bases and its corresponding coefficients.
u
1× u
2× K u
n×
Fig. 3-5 Image decomposition.
Let us have training images which include both human and non-human with images size are ( ). Reshape each image into a N-vector, then, the mixture data
m
r× c
n n
X is an (m x n) matrix. m x N) mixture matrix. Since all vectors are column vectors and the transpose of
is a row vector, we can rewrite Eq. 3.5 to Eq. 3.6 by using vector-matrix notations
=HS (3.6)
ithout loss of generality, we assume that both the mixture variables and independent ponents have zero mean and non-Gaussian distributions. For the nonzero mean distributions, the observable variables
rank (
can always be centered by subtracting the ple mean to become the zero m tribution. If W denotes the inverse of the basis matrix S, the coefficients matrix r training matrix
sam
X will be expressed in T
(3.7)
n-component base vectors which have the best distinguish ability for detecting
ans and nonhumans should be chosen from many candidate components.
3.2.2 Entropy feature selection
Unlike PCA features, the ICA features are not sorted, thus the conditional entropy is applied to feature selection, the sorting process and choosing an appropriate subset of ICA features. Sorting variables may be an important step to enhance the high-dimensional dataset, which gave us the idea to place correlated or similar
ensions close to each other in high-dimensional value space to help human user U =WXT
The hum
dim
If the entropy is the amount of information provided by a random variable, then our conditional entropy can be defined as the amount of information about one random variable provided another random variable. The entropy of a random variable reflects the more truthful information of the observed variable. If the variable is more random, it means unpredictable, which may result in the large entropy value.
The 2-D data space obtained from ICA feature extraction needs to be discretized into a matrix of a grid cell by separating each dimension into a set of intervals or bins.
The discretization process begins with calculating the mean value of data in one dimension and dividing the data into two halves with that mean value. Recursively, each half is divided into halves with its own mean value. The recursion will stop when we obtain the required number of intervals or meet the constraint of total bins. Let a discrete random variable Z be with proposed values { ,z z1 2,...,zm}. The information entropy of Z with the probability density p(z) is defined in
(3.8)
The conditional entropy quantifies the uncertainty of a random variable Y if given that the value of a second random variable Z is known. Each coefficient has to be normalized to [-1, 1] and quantized to n bins. Let Y= {-1, 1} be the desired class, the n the conditional entropy can be described in
(3.9) The conditional entropy (Y | Z) is a weighted sum of the entropy values in all columns, where the joint entropy is defined by
(3.10) We sort the conditional entropy (Y | Z) and use the sorted results to select corresponding independent components. The coefficients or independent components with the better classification ability are associated with the small conditional entropy.
1
The selected ICA features will be used in the SVM classifier to identify humans to nonhumans. The training database consisted of 1843 human and 840 nonhuman images with image size (40x40) pixels. The maximum number if ICA to 76 and we choose 30 best bases for SVM classifier.
3.2.3 SVM classifier
Support Vector Machines (SVMS) are developed to solve the classification and regression problem. SVM is a way which starts with a linear separable problem. For classification, the goal of SVM is to separate two classes by a function which is induced available example. Consider the example in Fig. 3-6, there are two classes of data and many possible linear classifier that can separate these data, but only one of them is the best classifier which can maximize the distance between two class-margin, this linear classifier is called optimal separate hyperplane. Fig. 3-7 shows the accuracy rate and the number of SV (support vector) foe all number of ICA to be 76. Based on this experiment result, the accuracy rate of human detection system will increase if number of IC is increasing. But in practice, we select 30 ICs because number of SVs is almost close to minimum and the accuracy rate more than 90%. The main reason to choose minimum number of SVs is to decrease the computation cost, in order to meet the real-time requirement.
Fig. 3-6 Optimal separating hyperplane
(a)
(b)
Fig. 3-7 Analysis of feature selection. (a) Accuracy rate. (b) Number of SV.
Chapter 4
Fig. 4-1 Human tracking flow chart
A human tracking system based on mean-shift algorithm is proposed in this thesis. The mean-shift algorithm is a simple iterative procedure that shifts each data point in its neighborhood and locating the maxima of a density function given discrete data sampled from that function [36]. Generally, the mean-shift algorithm uses color feature but in our thesis, we combine color feature and ICs (Independent Components analysis) feature. The idea behind the combination is come from human detection based on ICs feature in chap3. Although color feature able to track moving object but the combination color and ICs able to track not only moving object but moving human.
The Kalman filter is applied to predict the location of moving human in next frame. The Kalman filter actives when moving human partially occluded with other
The Kalman filter is applied to predict the location of moving human in next frame. The Kalman filter actives when moving human partially occluded with other