Chapter 1 Introduction
1.2 System overview
For hardware architecture, the system shown in Fig. 1.1 is established by setting two cameras on a horizontal line and their lines of vision are parallel and fixed. In addition, the distance between two cameras is set as constant equal to 10 cm and these two cameras, QuickCamTM Communicate Deluxe, have specification listed below. The experimental environment for testing is our laboratory and the deepest depth of the background is 180 cm.
1.3‐megapixel sensor with RightLight™2 Technology
Built‐in microphone with RightSound™ Technology
Video capture: Up to 1280 x 1024 pixels (HD quality) (HD Video 960 x 720 pixels)
Frame rate: Up to 30 frames per second
Still image capture: 5 megapixels (with software enhancement)
USB 2.0 certified
Optics: Manual focus
Fig. 1.1 The humanoid vision system.
For software architecture, the image shown in Fig. 1.2 is the flow chart of the proposed system. The moving object will be separate by the background subtraction.
The background subrtraction often uses many images to generate background. After background is generated, foreground will be obtained by background subtraction.
Because the shadow in the screen will influence the performance of detection shadow removal is applied to remove shadow [6] [15]. After shadow is removed morphology operation is appiled to removed noise. Morphology operation is a way to elliminate small area and enlarge big area. After the noise is removed, to locate the postion of the object image projection is applied. And the extract area will use feature extraction to obtain the feature. And then the data are feeding into
neuralnetwork to classify if it is pedestrian.
Fig. 1.2 The software architecture
Chapter 2
Related Work
2.1 Foreground segmentation
For the pedestrian detect system, the first task is to segment the region of the interesting(ROI) in the scene, there are three common way: frame difference, background subtraction, and optical flow.
Frame difference [10] method is to do pixel‐based subtraction in successive frames, then a specific threshold is used to separate foreground pixels from background pixels.
This method can quickly adapt to change of illumination and camera motion and lower computation. But it can only detect the ROI. This method does not yield good result when the interesting foreground regions are not sufficiently textured
Optical flow [9] reflects the image changes due to motion during a time interval, and the optical flow field is the velocity field that represents the three‐dimensional motion of foreground points across a two‐dimensional image. Compared with other two methods, optical flow can be more accurate to detect interesting foreground region. But optical flow computations are very intensive and difficult to realize in real time.
Background subtraction [8] is the most common method for segmentation of interesting regions in videos. This method has to build the initial background model firstly. The purpose of training background model is to subtract background image from current image for obtaining interesting foreground regions. Background subtraction method can detect the most complete of feature points of interesting foreground regions
and real‐time implementation. But this method can not used in presence of camera motion, and the background model must be updated in good time due to the illumination change and movement of background objects.
2.2 Shadow removal
Generally, the shadows are classified into two categories: cast shadows and
self‐shadows. Cast shadows refer to areas in the background model projected by objects
Li‐Qun Xu, Jose Luis Landabaso and Montse Paradas [7] analyze foreground pixels and detect those that have similar color but lower brightness to the corresponding background regions in RGB color space. If it is true, the foreground pixel is regarded as self‐shadow pixel.
2.3 Morphology operation
Morphology has two simple function dilation and erosion.
Dilation is defined as:
Where A and B are sets in Z. This equation simply means that B is moved over A and the intersection of B reflected and translated with A is found. Usually A will be the signal or image being operated on and B will be the structuring element. Figure 2.1 shows how dilation works.
Fig. 2.1 Example of dilation
The opposite of dilation is known as erosion. This is defined as:
{
: ( )x}
xx B
A B x B A A
∈
Θ = ⊆ =
∩
(2.2)The equation simply says, erosion of A by B is the set of points x such that B
translated by x is contained in A. Figure 2.2 shows how erosion works. This works in exactly the same way as dilation. However equation (2.2) essentially says that for the output to be a one, all of the inputs must be the same as the structuring element.
Thus, erosion will remove runs of ones that are shorter than the structuring element.
This thesis will applied two kind of this operation to process the image.
Fig. 2.2 Example of erosion
2.4 Neural network
2.4.1 Introduction to ANNs
The human nervous system consists of a large amount of neurons, including somas, axons, dendrites and synapses. Each neuron is capable of receiving, processing, and
w2passing electrochemical signals from one to another. To mimic the characteristics of the human nervous system, recently investigators have developed an intelligent algorithm, called artificial neural networks (ANNs), to construct intelligent machines capable of parallel computation. This thesis will apply ANNs to the depth detection in an eyeball system through learning.
Fig. 2.3 Basic element of ANNs
ANNs can be divided into three layers which contain input layer, hidden layer, and output layer. The input layer receives signal form the outside world, which just includes input values without neuron. The neuron’s number of output layer is depending on the output number. Form the output layer, the response of the net can be read. The neurons between input layer and output layer are belonging to hidden layer which does not exist which obeys the input‐output relations
1
where w
i is the weight at the input x
i and b is a bias term. The activation function f(․) has many types cover linear and nonlinear. Note that the commonly used activation function is
( ) 1
1 x
f x = e−
+ (2.4) which is a sigmoid function. Base on the basic element, the commonest multilayer feed‐forward net shown in Fig. 2.4 Multilayer feed‐forward network, which contains input layer, output layer, and two hidden layers. Multilayer nets can solve more complicated problem than single layer nets, i.e. a multilayer nets is possible to solve some case that a single layer net cannot be trained to perform correctly at all. However, the training process of multilayer nets may be more difficult. The number of hidden layer and its neuron in the multilayer net are decided by complicated degree of the problem wait to solve.
Fig. 2.4 Multilayer feed‐forward network
important matter of different neural net. For convenience, the training for a neural
2.4.2 Back-Propagation Network
In supervise learning, the back propagation learning algorithm, is widely used in most application. The back propagation, BP, algorithm was proposed in 1986 by Rumelhart, Hinton and Williams, which is based on the gradient steepest descent method for updating the weights to minimize the total square error of the output. The training by BP mainly is applied to multilayer feed-forward network which involves three stages: the feed-forward of the input training pattern, the calculation and back-propagation of the associated error, and the adjustment of the weights. Fig. 2.5 Back-propagation network shows a
back-propagation network contains input layer with wgh Ninp neurons, one hidden layer with Nhidneurons, and output layer with Nout neurons. In Fig. Back-propagation
network, [ 1 2 ]
and [ 1 2 ]
out
T
y= y y yN respectively represent the input, hidden, and out note of the network. In addition vij
is the weight form the i-th neuron in the input layer to j-th neuron in the hidden layer and wgh is the weight form the g-th neuron in the hidden layer to h-th neuron in the output layer.
Fig. 2.5 Back‐propagation network
The learning algorithm of BP is elaborated on below:
Step 1: Input the training data of input [ 1 2 ]
inp
T
x= x x xN and desired output
1 2
[ ]
inp
T
t= t t tN . Set the maximum tolerable error Emax and learning rate η which between 0.1 and 1.0 to reduce the computing time or increase the precision.
Step 2: Set the initial weight and bias value of the network at random.
1 where f ih( ) is the activation function of the neuron and the output of the i‐th neuron in output layer
where f iy( ) is the activation function of the neuron.
Step 4: Calculate the error function between network output and desired output.
out out hid
N N N
out hid inp
N
( 1) ( )
Step 8: Check whether the network converge. If E<Emax, terminate the training process; otherwise, begin another learning circle by going to Step 1.
BP learning algorithm can be used to model various complicated nonlinear functions. Recently years The BP learning algorithm is successfully applied to many domain applications, such as: pattern recognition, adaptive control, clustering problem, etc. In the thesis, the BP algorithm was used to learn the input-output relationship for clustering problem.
Chapter 3
Pre‐processing and detect algorithm of pedestrian detection
Before classifying objects to be pedestrian or non‐pedestrian, there are two pre‐processes required to make the classification system work well. First, determine the locations of moving objects, and then find out the object’s location in the scene.
In general, there are three techniques adopted to determine the locations of moving objects, which are background subtraction [8], optical flow [9], and frame difference [10]. This thesis will employ the background subtraction since it consumes less computation time and promotes the efficiency of classification.
3.1 Foreground segmentation
Foreground segmentation is one of the most important techniques in the system, because it increases the processing efficiency and its performance decides the quality of classification. Conventional foreground segmentation algorithms are roughly classified into two categories. The first category [11] uses spatial
homogeneity as a criterion, but consumes tremendous computation time. The second category [12 ] is based on the frame change or background mosaics, and can process fast to distinguish the object regions from a static background.
Collins et al. [13] combines the temporal difference and background to detect moving pixels. First, moving regions are detected from two frames in sequence with the temporal difference, and then the compact moving pixels are further extracted by subtracting the background in these regions. But they also model each pixel of
background with a normal distribution.
The foreground segmentation algorithm adopted in this paper has been presented by Kim et al. [14] as shown in Figure 3.1, which belongs to the second category. It is known that the background information is very sensitive to the variation of illumination. To deal with such problem, in the initialization step two background masks, Imin and Imax, are obtained from the first N frames and defined as
where Ik(x, y) is the intensity of the kth frame at (x, y), k=1,2,…,N. Once finish the initialization, the frame difference mask is calculated as
, ( , ) ( , ) 1( , )
fd k k k
I x y = I x y −I − x y , k>N (3.3)
Let Ifg k, ( , )x y be the foreground mask corresponding to the kth frame I x yk( , ). Based on (3.1), (3.2) and (3.3), if I x yk( , ) satisfies the following conditions I x yk( , )<Imin( , )x y −Thbg (3.4) I x yk( , )>Imax( , )x y +Thbg (3.5)
Ifd k, ( , )x y >Thfd (3.6)
then Ifg k, ( , ) 1x y = , representing the pixel at (x,y) belongs to foreground, otherwise
, ( , ) 0
Ifg k x y = , representing the pixel at (x,y) belongs to background. Next the shadow removal method will be applied to remove the shadow in the mask
, ( , ) Ifg k x y .
Fig. 3.1 Foreground segmentation algorithm
3.2 Shadow removal
In general, an object in a scene is unavoidable to have shadow caused by the light source. Since the shadow often leads to errors in classification and objects detection, it is required for a pedestrian detect system to remove the shadow of the object to be detected by the so‐called shadow removal method. This thesis will refer to the method proposed by Cucchiara et al. [15], which uses the HSV color space, closer to human projection of color, and discriminates shadows from foreground more accurately.
It is known that a smaller difference exists in hue between shadow and
background and the shadow often has lower saturation. Based on the kth frame Ik(x,y) and the background image B(x,y), the proposed algorithm [15] for shadow remove
where τS, τH, α and β are the threshold values. The use of threshold α is to avoid the dark pixel being misclassified as shadow due to the similar hue and saturation information in the dark pixels. The use of threshold β is to eliminate noise resulted from slightly change in background model as shadow. The shadow removal operation will be applied only when foreground mask Ifg k, ( , )x y at (x,y) is equal to 1 and the
value Ifg k, ( , )x y will be change into 0 while Ik(x,y) satisfy three conditions. Next the morphology operation will be applied to reduce the noise in the foreground mask.
3.3 Morphology operation
After applying shadow removal operation, shadows are removed from the foreground mask, but some noise still exists therein. One of the conventional ways to eliminate noise regions is using the morphological operations to filter out isolated regions, smaller than 3x3. In the thesis, the isolated regions are eliminated by the following morphology opening operation defined as:
( )
A B= Θ ⊕ A B B (3.10)
which combines the erosion operation and the dilation operation, as shown in Figure 3.2. As can be seen, the zeros are opened up. Any ones that are shorter than the structuring element are removed, but the rest of the signal is left unchanged.
Fig. 3.2 Example of opening operation
In order to smooth the boundaries of the foreground mask f(j,k) and eliminate holes therein to achieve a new foreground mask F(j,k), this thesis adopts a profile extraction technique which is processed by the following steps [16]:
Step‐1:
For the j‐th row, j=1,2,…,240, the index of the pixel (j,k) is chosen horizontally from k=1 to k=301 as below:
1
( )
20( )
1
, 1 0.05( 1) ( , 1)
i
g j k i f j k i
=
=
∑
− − + − (3.11)where f(j,k) is the foreground mask. With the use of threshold M, which is determined by trial and error, the resulted image is obtained as
1
Different to Step‐2, the index of the pixel (j,k) for the k‐th column, k=1,2,…,320, is chosen vertically but reversely from j=240 to j=20 as below:
With the reverse process to Step‐1, the index of the pixel (j,k) for the j‐th row, will be determined by horizontal histogram which is calculated as
whereHi denotes i‐th column horizontal histogram and ( , )F i j denotes the final foreground mask. Then, digitize Hi as
1 0
Further define the horizontal boundary index as
Clearly, 1
i
BH = and 1
i
BH = − respectively represent the left and right boundaries of an object at i‐th column. The amount of columns with 1
i
BH = is the number of objects.
Step‐2:
According to the horizontal boundary index, the columns related to each object are obtained and processed separately. Suppose that the k‐th object is between p‐th column and q‐th column, i.e., BpH =1 and 1
q
BH = − , then the vertical histogram is calculated as
( ) ( , ), 1 240
BV k = − respectively represent the upper and lower boundaries of an object at j‐th row. The example is shown as Figure 3.3. After the histogram projection, the object’s boundary can be found out. Next, the feature
Fig. 3.3 Example of histogram projection
3.5 Feature extraction
The inputs of neural network are the features generated by feature extraction, including histogram‐of‐oriented‐gradients (HOG), global averaging, Haar‐like features and image gradient.
The feature HOG is first used for pedestrian detection by Shashua et al.[17]. They extract orientation histogram features from 13 fixed overlapping parts according to the different subregions and clustered training subsets. The method presented in the thesis is proposed by Dalal and Triggs[18] and Zhu et al.[19], which calculate the HOG by the following steps:
Step‐1 :
Compute both the magnitude and orientation of the gradient, which are
where Fx and Fy are the respective gradients in the horizontal and vertical directions obtained by convolving the image with Sobel filters.
Step‐2:
The gradient orientation is evenly divided into 9 bins over 0°to180°. The sign of the orientation is ignored; thus, the orientations between 180° to 360° are deemed the same as those between 0°and180°. Then, the gradient orientation histograms E i j( , )k in each orientation bin k of block ( , )B i j are obtained by
summing all the gradient magnitudes whose orientations belong to bin k in B i j( , )
( , ) ( , )
( )
2and e is a small constant.
Step‐3:
After normalization, all the normalized histogramsNE i j( , )are further grouped into a single vector expressed as
An example of HOG is shown in Fig.3.4.
For the second feature, global averaging [21], a pedestrian image of size 36x18
where pi(k,l) is the pixel value at coordinates k and l in block i. After all the pattern average values PatAvi are obtained, they will be used as the input to the neural network for both training and testing.
(a) (b)
(c)
Fig. 3.4 (a) Original image. (b) Gradient orientation and magnitude of each pixel are illustrated by the arrows’ direction and length. (c) HOG computation in each cell
For the third feature, Haar‐like feature, it can effectively capture the different appearance details of objects and have a fast algorithm with the help of integral images [20]. To generate the haar‐like features of an image, it is required to separate
vector of haar‐like features is obtained. For example, there are five kinds of masks shown in Fig. 3.5, each mask with feasible size and aspect ratio. In this thesis, the second type of mask will be used to generate the haar‐like feature HHaar, whose ith compoment is corresponding to the ith block Bi and obtained as
( ) ( ) ( )
Haar
white i black i
H i =S B −S B (3.33)
where Swhite(Bi) and Sblack(Bi) are respectively the intensity summations subject to the white and black regions of the mask. In order to generate more dimension of the haar‐like feature, the author modifies the equation as
( ) ( ( , )) ( ( , ))
Haar
white black
H i =S p k l −S p k l (3.34)
where p(k,l) is the pixel value at coordinates k and l.
Fig. 3.5 Haar‐like mask
For the fourth feature, Gradient feature, it is a vector formed by the gradient image with pixel (Fx , Fy). An example of gradient image is shown in Fig. 3.6.
Fig. 3.6 The gradient image for Fx and Fy
After the four features are generated, they are used as the training patterns of a neural network to determine whether the image is pedestrian or not.
3.6 Detect Algorithm
In general, traditional way uses only one target object’s feature for the pedestrian recognition. Unfortunately, pedestrian‐related images are usually too complicated to recognize only by one feature. To improve such defect, intuitively the use of more features should lead to a higher recognition performance. But using multi‐feature as the training pattern for a neural network will increase the difficulty of its learning.
The thesis proposes a novel structure of neural network, which includes four primary neural networks and one secondary neural network as shown in Fig.3.10, to deal with the multi‐feature problem.
recognition, one for the primary neural networks and one for the secondary neural network. For the first stage, there are four primary neural networks, named as HOG‐neural network(HOGNN), Gradient‐neural network(GRDNN), Haar‐neural network(HARNN) and Global averaging neural network(GAVNN), each related to one feature and trained in back‐propagation.
For HOGNN, there include one input layer with 1152 nodes, two hidden layers with 100 and 30 nodes, and one output layer with 1 node. There are 1152 components of the HOG feature vector in (3.x) and all the components, represented by HOG(p), p=1,2,…,1152, are sent into the 1152 nodes of the input layer, correspondingly. The pth input node is connected to the qth node of the first hidden layer with weighting W1hog(p,q). Hence, there exists a weighting array W1hog(p,q) of dimension 100x1152, p=1,2,…,1152 and q=1,2,…,100. Besides, the qth node of the first hidden layer is also added with an extra bias b1hog(q), q=1,2,…,100. Similarly, The qth node of the first hidden layer is connected to the rth node of the second hidden layer with weighting W2hog(q,r), which results in a weighting array W2hog(q,r) of dimension 30x100,
q=1,2,…,100 and r=1,2,…,30. An extra bias b2hog(r), r=1,2,…,30, is added to the rth node of the second hidden layer. Finally, the rth node of the second hidden layer is connected to the output node with weighting W3hog(r), r=1,2,…,30, and a bias b3hog is added to the output node.
Let the activation function of the first hidden layer be the hyperbolic tangent sigmoid transfer function and then the output of its qth node O1hog
( )
q is expressedLet the activation function of the first hidden layer be the hyperbolic tangent sigmoid transfer function and then the output of its qth node O1hog