Chapter 2 System Overview
2.2 Software system
22 The whole system consists of three major parts: Human detection, human tracking and PTZ control. Input frame are captured by active camera with resolution 320x240 and moving object will be extracted by background difference. We use previous 20 frames to build a background image. The independent component analysis (ICA) and support machine vector (SVM) classifier are applied to classify the moving object into human or non-human.
When a human is detected then we need to focus on it. Mean-shift algorithm will
track the target by compute the similarity value in every frames and send the position to active camera, then active camera will drive PTZ to keep the target in the center of FOV. Sometimes human will be partially or fully occluded by other object, thus the similarity will drastically decrease, consequently mean-shift will miss track the target. In this case, the Kalman filter is activated to predict the target’s position in next frame.
23
Image source part Human detection part Human tracking part PTZ control part
Pan & Tilt control Occlusion??
Static Dynamic
Fig2-2 system overview
Chapter 3
Object extraction and Human detection
24
25
Fig. 3-1 Human detection system
Chapter 3 describes moving object extraction, including preprocessing and classifying moving object into human and nonhuman. The architecture of moving object extraction was indicated in the dotted-line block in Fig. 3-1 and the remained blocks represented our human feature extraction and classification.
3.1 Moving object extraction
26 Mostly, in surveillance system, the default position of camera is fixed even the camera is an active camera, so we can use the still image as background image.
Background subtraction is the simplest way to extract moving object from an image. Besides, we used the background subtraction method in order to meet the real-time requirement.
27 Our background image was constructed by using first 20 frames. The difference for each pixel (x, y) could be calculated by
28 IBS( , ) |x y = IC( , )x y −IB( , ) |x y (3.1) 29 where IC and IB denote the current and background gray image, respectively.
IBS denotes the background subtraction image. A threshold value ths is choose to oving object
produce binary m Mobjas described in following equation.
30 (3.2)
31 The dilation process applied on
1 ( , )
Mobjto gradually enlarge the boundaries of moving object pixels (foreground). Thus areas of foreground pixels grow in size while holes within those regions become smaller.
32 i j (3.3)
T in Eq. 3.3 denotes the total foreground pixel in a (3x3) dilation mask. If at least one pixel coincides with foreground pixel, then the center of (3x3) mask is sets to the foreground value. If all the corresponding pixels are background then it will set to background value.
35 Connectivity between pixels is a fundamental concept that simplifies the definition of numerous digital image concepts, such as regions and boundaries.
To establish whether these two pixels are connected, it is determined by their neighbors and finds their gray levels satisfy a specified criterion or similarity. For instance, in binary image with values 0 and 1, two pixels maybe 4-neighbors, but they are said to be connected only if they have the same value. Connected component works by scanning an image, pixel-by-pixel in order to identify connected pixel regions and labeling the pixel that connected together with same
label. Once all groups have been determined, each pixel is labeled with a gray level or a color according to the component it was assigned to as shown in Fig.
3-2. After connected component, we observe the exist labels. In actual situation not all of them are belong to moving object pixels, therefore we use a (9x9) size filter to eliminate noise labels. The connected component without noise clearly observed in Fig. 3-2. All of these moving object extract result are show in Fig.
3-3.
36
IB
IC
Mobj ID ICC
ISF
IROI
Fig. 3-2 Moving object extraction 37
38
39 (a)
(b)
(c)
40 Fig. 3-3 Moving object extract processing. (a) Left is IB. Right is IC. (b) left is Mobj. Right is ICC. (c) Left is ISF. Right is IROI.
3.2 Human detection
We applied independent component analysis (ICA) on the foreground regions for feature extraction proposed. Then classified it into human or nonhuman by SVM classifier shown in Fig. 3-4.
ICA feature
Fig. 3-4 Human detection flow chart
3.2.1 ICA feature extraction
In recent years, ICA [38] has been applied to human feature extraction for constructing a sufficient set of features describing human beings. ICA is a statistical method for transforming an observed multidimensional random vector into components that are statistically independent. ICA is a generalization of principle component analysis (PCA), it is a high-order statistic approach. As Fig. 3-5 shows ICA transforms each input image to the combination of bases and its corresponding coefficients.
u
1× u
2× K u
n×
Fig. 3-5 Image decomposition.
Let us have training images which include both human and non-human with images size are ( ). Reshape each image into a N-vector, then, the mixture data
m
r× c
n n
X is an (m x n) matrix. m x N) mixture matrix. Since all vectors are column vectors and the transpose of
is a row vector, we can rewrite Eq. 3.5 to Eq. 3.6 by using vector-matrix notations
=HS (3.6)
ithout loss of generality, we assume that both the mixture variables and independent ponents have zero mean and non-Gaussian distributions. For the nonzero mean distributions, the observable variables
rank (
can always be centered by subtracting the ple mean to become the zero m tribution. If W denotes the inverse of the basis matrix S, the coefficients matrix r training matrix
sam
X will be expressed in T
(3.7)
n-component base vectors which have the best distinguish ability for detecting
ans and nonhumans should be chosen from many candidate components.
3.2.2 Entropy feature selection
Unlike PCA features, the ICA features are not sorted, thus the conditional entropy is applied to feature selection, the sorting process and choosing an appropriate subset of ICA features. Sorting variables may be an important step to enhance the high-dimensional dataset, which gave us the idea to place correlated or similar
ensions close to each other in high-dimensional value space to help human user U =WXT
The hum
dim
If the entropy is the amount of information provided by a random variable, then our conditional entropy can be defined as the amount of information about one random variable provided another random variable. The entropy of a random variable reflects the more truthful information of the observed variable. If the variable is more random, it means unpredictable, which may result in the large entropy value.
The 2-D data space obtained from ICA feature extraction needs to be discretized into a matrix of a grid cell by separating each dimension into a set of intervals or bins.
The discretization process begins with calculating the mean value of data in one dimension and dividing the data into two halves with that mean value. Recursively, each half is divided into halves with its own mean value. The recursion will stop when we obtain the required number of intervals or meet the constraint of total bins. Let a discrete random variable Z be with proposed values { ,z z1 2,...,zm}. The information entropy of Z with the probability density p(z) is defined in
(3.8)
The conditional entropy quantifies the uncertainty of a random variable Y if given that the value of a second random variable Z is known. Each coefficient has to be normalized to [-1, 1] and quantized to n bins. Let Y= {-1, 1} be the desired class, the n the conditional entropy can be described in
(3.9) The conditional entropy (Y | Z) is a weighted sum of the entropy values in all columns, where the joint entropy is defined by
(3.10) We sort the conditional entropy (Y | Z) and use the sorted results to select corresponding independent components. The coefficients or independent components with the better classification ability are associated with the small conditional entropy.
1
The selected ICA features will be used in the SVM classifier to identify humans to nonhumans. The training database consisted of 1843 human and 840 nonhuman images with image size (40x40) pixels. The maximum number if ICA to 76 and we choose 30 best bases for SVM classifier.
3.2.3 SVM classifier
Support Vector Machines (SVMS) are developed to solve the classification and regression problem. SVM is a way which starts with a linear separable problem. For classification, the goal of SVM is to separate two classes by a function which is induced available example. Consider the example in Fig. 3-6, there are two classes of data and many possible linear classifier that can separate these data, but only one of them is the best classifier which can maximize the distance between two class-margin, this linear classifier is called optimal separate hyperplane. Fig. 3-7 shows the accuracy rate and the number of SV (support vector) foe all number of ICA to be 76. Based on this experiment result, the accuracy rate of human detection system will increase if number of IC is increasing. But in practice, we select 30 ICs because number of SVs is almost close to minimum and the accuracy rate more than 90%. The main reason to choose minimum number of SVs is to decrease the computation cost, in order to meet the real-time requirement.
Fig. 3-6 Optimal separating hyperplane
(a)
(b)
Fig. 3-7 Analysis of feature selection. (a) Accuracy rate. (b) Number of SV.
Chapter 4
Fig. 4-1 Human tracking flow chart
A human tracking system based on mean-shift algorithm is proposed in this thesis. The mean-shift algorithm is a simple iterative procedure that shifts each data point in its neighborhood and locating the maxima of a density function given discrete data sampled from that function [36]. Generally, the mean-shift algorithm uses color feature but in our thesis, we combine color feature and ICs (Independent Components analysis) feature. The idea behind the combination is come from human detection based on ICs feature in chap3. Although color feature able to track moving object but the combination color and ICs able to track not only moving object but moving human.
The Kalman filter is applied to predict the location of moving human in next frame. The Kalman filter actives when moving human partially occluded with other object, in other hand the mean-shift similarity value decrease until smaller than a
4.1 features
4.1.1 color feature
Most kernel based object tracking use color as feature. Color information extends into three dimensions of original grey scale image so it will increase good tracking performance. Most papers use mean-shift as tracking algorithm usually consider color as feature to accomplish object lock. We compare the performance of three color space: RGB, HSV and Y’UV. Based on experiment, the HSV color space give a good performance while tracking, thus we apply it on tracking with active camera.
HSV color transformation:
The main purpose of HSV color space transformation is to reduce the sensitivity of illumination or lightness information in RGB color space. In Fig. 3-6, Fig. 3-7 are RGB and HSV color space, respectively.
The HSV model, also known as HSB model, was created in 1978 by Alvy Ray Smith. It is a nonlinear transformation of the RGB color space. It defines a color space in terms of three components: hue, saturation, and value. The definition is described below: [16]
1. Hue: It is the color type and ranges from 0 ~ 360 degree. Each value corresponds to one color. For example, 0 is red, 45 is orange and 55 yellow. When it comes to 360 degree, it is also equal to 0 degree.
2. Saturation: It is the intensity of the color, and ranges from 0%~100%. 0 means no color, and that means only gray value between black and white exists. 100 means the intense color with the most color variety.
3. Value: It is the brightness of the color, and also ranges from 0%~100%. 0 is
always black. Depending on the saturation, 100 may be white or a more or less saturated color.
(a) (b)
Fig.4-2 (a) RGB color model [17] (b)HSV color model [18]
The transformation algorithm from RGB color model to HSV color model could be described in the following equation.
0 where MAX, MIN are maximum and minimum value of (R, G, B), respectively.
Y’UV color transformation
The Y’UV color model defines a color space in terms of one luma (Y') and two chrominance (UV) components. The Y'UV color model is used in the NTSC, PAL, and SECAM composite color video standards. Y' stands for the brightness component and U and V are the chrominance components. The transform equation are show
V =MAX
following:
Each channel of color space has 8-bits data, means (255x255x255). We quantize the color space into (16x16x16). Therefore the histogram of color feature consists of 4096 bins.
4.1.2 ICA feature
In chapter 3, the human detection algorithm uses entropy feature selection to select 30 ICs from ICs. These 30 ICs are process by SVM classifier to classify whether the foreground object is human or nonhuman. If SVM classifier is classify the foreground object as human then these will use in mean-shift tracking algorithm.
In practice, the combination of color and ICs will construct a histogram with total bins equal to 4126. The combination procedure is described as follow:
1. Select the ROT (Region of interest), in this case is human’s ROI.
2. Compute the color histogram by kernel function, as shown in Fig. 4-3 (b) 3. ICs extracted.
4. Combined the histogram of color and ICs features, as shown in Fig. 4-4
0 Fig. 4-3 (a) Human’s ROI (b) Kernel function (c) Histogram of color feature
Fig. 4-4 Color and ICA feature histogram
4.2 Kernel functions
The feature histogram based target representations are regularized by spatial masking with an isotropic kernel. The masking induces spatial-smooth similarity functions suitable for gradient-based optimization, hence, the target localization problem can be formulated using the basin of attraction of the local maxima [19].
There are many kinds of kernel functions, such as Gaussian kernel, Flat kernel and Epanechnikov kernel. Let x be normalized pixel as location in the region defined as target model, then the Gaussian kernel, Flat kernel and Epanechnikov kernel [36]
are defined as follows.
3. Epanechnikov kernel
1 || || 1
Fig. 4-5 (a) and (c) show that the Gaussian and Epanechnikov kernel are similar.
They have highest value are the center distribution. If we take a looking the ROI of target model, the more closer to the center of ROI is containing more important pixels
Therefore, Gaussian and Epanechnikov kernel can regardless the boundary information and the accuracy will larger than flat kernel.
(a) (b) (c)
Fig. 4-5 (a) Gaussian kernel (b) flat kernel (c) Epanechnikov kernel
(a) (b) (c)
Fig. 4-6 (a) Target object (b) Kernel function (c) Target object and Kernel function
4.3 Mean-shift algorithm
In order to characterize the target, first a feature space is chosen. The reference target model is represent by normalized histogram q in the feature space. The target model can be considered as centered at the spatial location 0. In the subsequent frame, the candidate model is defined at location y, be expressed as p(y). We use Eq. 4.8 and Eq. 4.9 as our target and candidate model, respectively.
^ ^
The similarity value between and
^
Let be normalized pixel locations in the region defined as the target model. The region is centered at 0. Here we use Epanechnikov kernel , using these weights increases the robustness of the density estimation since the peripheral pixels are the least reliable.
The function associates to the pixel at location
^ ^
in the quantized featu r
^ 2
where δ is the Kronecker delta function. The normalization constant C is derived by imposing the condition from where
^
since the summation of delta functions for u=1…m is equal to one.
Candidate model
Let be the normalized pixel locations of the candidate model, center at y in the current frame. The normalization is succeed from the frame containing the target model. Here we use Epanechnikov kernel same as target model, but with bandwidth h,
*
{ }xi i=1...n
the probability of the feature u=1…m in the candidate model is given by
is the normalization constant. Note the does not depend on y because the pixel locations
Ch
x are organized in a regui y is one of the lattice nodes. The bandwidth h defined the scale of the target candidate.
Similarity measure
The similarity function defines a distance among target model and candidate model. The Bhattacharyya coefficient, which evaluates the similarity of the target model and the candidate model, is defined as
lar lattice and
To find the location corresponding to the target in the current frame, the Bhattacharyya coefficient in Eq. (4.16) should be maximized as function of y which can be solved by running the mean-shift iterations.
Object localization
Color and ICs information were chosen as the features, however, the same framework can be used for texture and edges, or any combination of them. In the sequential, it is assumed that the following information is available: a. d detection and localization of the objects track in the initial frame b. Every objects periodic analysis accounting for possible updates of the target models due to significant changes in color.
Minimize the distance d(y) is equivalent to maximizing the Bhattacharyya coefficient
. The search for the new target location in the current frame starts at the location y of the target in the previous frame. So the probabilities of the candidate at location
^ ^
0 1...
{pu(y )}u= m
^
y in the current frame have to be computed first. Using Taylor 0
expansion around the values
^ ^
In order to minimize the distance d(y), the second term in Eq. 4.17 has to be the maximized, the first term being independent of y. The second term represents the density estimate computed kernel function at y in the current frame, with the data being weighted by Eq. 4.18. In this process, the kernel is recursively moved from current location
y according to the relation. The distance 1
between
Fig. 4-7 Mean-shift algorithm flow chart Given the target model and its location
^
{ }qu u=1...m
^
y0in the previous frame, set initial previous similarity value equal to 0, then the mean-shift algorithm is described as following,
1. Initialize the location of the target in the current frame with
^
3. Find the next location of the candidate according to Eq. 4.19.
4. Compute ,and evaluate
Evaluate
^
[ ( ), ]1
^
p y q
ρ and go to step 2.
Otherwise
^ 0
^
y ← and break. y1
4.4 ROI resizing
In real situation, human probably walk toward or keep away from camera, thus fixed ROI’s scale is not suitable because the ROI will contain some background pixels or only some parts of human. Consequently, it will influence tracking result shown in Fig. 4-8 and a similarity value smaller than a threshold value. Therefore, the ability to resize ROI’s scale is an important issue. In our system the ROI scale is adjusting every 100 frames.
Fig. 4-8 ROI scale larger than object
Fig. 4-9 ROI resize flow chart
In order to adjust ROI scale adaptively, first we use temporal difference to find the foreground image and its position we do the temporal difference only in ROI’s regions. The dilation process is applied to gradually enlarge the boundaries of for ground pixel and link the broken boundary parts which obtain temporal difference.
The projection of dilation image into x-axis and y-axis will produce the current width (widthcurrent) and height (
is determ
) shown in Fig 4-10. Finally, the new ROI’s size ined with Eq. 4.20. Som times, the dilation process unable to link broken , th ward x-axis and y-axis will produce ROI actual hum So we will set the minimum size of ROI. After the s scale is determ rget model will be update, too. The update thod is use histogram shown in Eq. 4.21, where α is set to 0.6.
Fig. 4-10 X-axis and Y-axis image projection
In Fig. 4-11 and 4-12 show the ROI resize in fixed and active camera, respectively.
(a)
(b)
Fig. 4-11 Fixed camera condition (a) and (b) left is difference image. Right is after resize scale.
(a)
(b)
(c)
(d)
(e)
Fig. 4-12 Active camera condition. (a) original image (c-e) left are difference images.
Right is ROI resizing scale image
4.5 Kalman filter
Fig. 4-13 Kalman filter flow chart
Fig. 4-13 shows the Kalman filter algorithm with use to predict the target location when the target is occluded with other object or if Bhattacharyya similarity
Fig. 4-13 shows the Kalman filter algorithm with use to predict the target location when the target is occluded with other object or if Bhattacharyya similarity