• 沒有找到結果。

Chapter 2 Related Works

2.4 Morphology Operations

2.4.2 Opening and Closing

As we have seen, dilation expands an image and erosion shrinks it. This section will introduce two other important morphological operations based on dilation and erosion, which are opening and closing [33]. Generally, opening can smooth the contour of an object, breaks narrow isthmuses, and eliminates thin protrusions.

Closing, as opposed to opening, it fuses narrow breaks and long thin gulfs, eliminates small holes, and fills gaps in the contour, but it also tends to smooth sections of contours.

The opening operation is the dilation of the erosion of a set A by a structuring element B, defined as

( )

A BA = A BΘ ⊕B (2.21)

where Θ and ⊕ denote the erosion and the dilation operations, respectively. For example, Fig-2.10 is a simple geometric interpretation of opening operation. The region of A BA is established by the points of A that can be reached by any point of B when B moves vertically or horizontally inside A. In fact, the opening operation can be also expressed as

{( ) | ( )z z }

A BA = B BA (2.22)

where  �{ } denotes the union of all the sets inside the braces.

Fig-2.10 Example of opening operation

Similarly, the closing operation is simply the erosion of the dilation of a set A by a structuring element B, defined as

( )

A B• = A⊕ ΘB B (2.23)

where ⊕ and Θ respectively denote dilation and erosion operation. Fig-2.11 is also a geometric example which introduces closing operation. In the Fig-2.11, B is moved on the outside of the boundary of A. Besides, it will be shown that opening and closing are duals of each other, so having to move the structuring element on the outside is not unexpected.

Fig-2.11 Example of closing operation

Chapter 3

Human Detection Algorithm

Human detection system is implemented by four main steps as shown in Fig-3.1, including region-of-interest (ROI) selection, feature extraction, human shape recognition and motionless human checking. The system obtains the depth images and color images from Kinect, and then it selects the region-of-interest (ROI) through histogram projection, connected component labeling (CCL) and moving objects segmentation. After ROI selection, the size of the ROI would be normalized based on the bilinear interpolation no matter how large it is. Afterwards histogram of oriented gradients (HOG) is implemented to be human feature descriptors. At the next step, the overall features are passed to the human shape recognition system to judge whether the ROIs contain human or not. Finally, the system check if any motionless human exists in the current frame, and regard it as new ROI in next frame.

Fig-3.1 flowchart of the human detection system

Fig-3.2(a) shows an example of the depth image generated by Kinect, which contains 320×240 pixels with intensity values normalized into 0-255. The intensity value indicates the distance between object and camera, and the lower intensity value implies the smaller distance. Besides, Fig-3.2(a) shows an example of the color image, which contains 320×240 pixels as well.

(a) (b)

Fig-3.2 Example of the depth image and color image generated by Kinect

3.1 ROI selection

In order to reduce computational cost, the foreground segmentation is required to filter out the background to gain the region of interest (ROI), that is, a human moving in the image. The system is designed to detect the moving objects from the temporal difference frame by frame, and based on the difference the system applies the histogram projection, connected component labeling (CCL) and the moving object segmentation to select the ROI and then increase the speed and detection rate.

3.1.1 Histogram Projection

Based on the information of depth image with size 320×240, the system could

implement histogram projection as the following steps. First, the system computes the histogram of all columns in depth image with intensity levels in the range [0, 255].

Let the histogram of i-th column in depth image be

0, 1, 2, 255,

[ ] , 1, 2,T , 320

i i i i i

h = c c cc i=  (3.1)

where ck i, is the number of pixels related to intensity k in the i-th column. Then, define the histogram image as

1 2 3 320

with size 320×256. Note that the histogram image H could be considered as the vertical distribution of objects or humans in the real world. However, there are a large amount of pixels of intensity k=0 due to the camera limitation, not able to detect the depth of objects near to the camera or the depth of background far from the camera.

Thus, the first row of H, c0= [c0,1 c0,2 c0,3 …c0,320], contains large values of c0,i.

After obtaining the histogram image, it has to be further processed to filter out unnecessary information such as the first row c0 of H. Because the effective range of the camera is from 1.2m to 3.6m, the corresponding intensity range is [40,240] and the value ck i, in H is adjusted as

Obviously, the values of H in the first 40 rows and last 15 rows are all set to be zero, which implied that the objects near to the camera and the background far from the

camera are also filtered out. Through filtering, the histogram value ck i, can be treated as the vertical distribution of the objects at coordinate ( , )i k in the real world.

Consequently, the histograms of depth intensity within a specified range, which may be an object or human, would be connected and corresponding to a clear shape in the histogram image.

Afterward the histogram image can be treated as a top-view image of 3-D distribution in (i,k) coordinate. If an object has a higher vertical distribution, it would have lager intensity in the top-view image. In order to eliminate non-human objects with lower intensity, the histogram image H is further processed to obtain the ROI image R as

where the threshold M is a suitable height of human. Then, the closing operation of morphology is implemented to eliminate the too elongated object and enhance the interior connection of an object. Hence, there are only the objects or humans with a certain height in the ROI image R.

3.1.2 Connected Component Labeling

Connected Component Labeling (CCL) [34] is a technique to identify different components and is often used in computer vision to detect connected regions containing 4- or 8-pixels in binary digital images. This thesis applies the 4-pixels connected component to label ROI in the image R.

The 4-pixel CCL algorithm can be partitioned into two processes, labeling and componentizing. The input is a binary image like Fig-3.4(a). During the labeling, the

image is scanned pixel by pixel, from left to right and top to bottom as shown in Fig-3.3, where p is the pixel being processed, and r and t are respectively the upper and left pixels.

Fig-3.3 The process of scanning image

Defined v( )• and l( )• as the binary value and the label of a pixel. N is a counter and its initial value is set to 1. If v(p)=0, then move on to next pixel, otherwise, i.e., v(p)=1, the label l(p) is determined by following rules:

Rule 1: If v(r)=0 and v(t)=0, assign N to l(p) and then N is increased by 1.

For example, Fig-3.4(a) is changed into Fig-3.4(b) after the process of labeling.

Clearly, some connected components contain pixels with different labels in Fig-3.4(b).

Hence, it is required to further execute the process of componentizing, which sorts all the pixels connected in one component and assign them by the same label, the

r

smallest number among the labels in that component. Fig-3.4(c) is the result of Fig-3.4(b) after the process of componentizing.

(a) Binary image

(b) Labeling

(c) Componentizing

Fig-3.4 Example of 4-pixel CCL.

0 0 0 0 0 0 0 0 0 0

3.1.3 Moving Objects Segmentation

Moving objects segmentation is based on temporal difference in the image sequence frame by frame. The system subtracts previous frame Graypre from current frame Graycur to acquire their difference, and then gives a threshold to create the binary image Graysub correspondingly, which are defined as following:

0,1 0,2 0,3 0,320 and y-coordinate, respectively.

Further, the binary image is processed by morphological operation in order to avoid the influence of illumination change such that the moving objects are highlighted in the binary image. Taking Fig-3.5 as an example, Fig-3.5(a) is the previous frame and Fig-3.5(b) is the current frame. Fig-3.5(a) shows the difference between the previous frame and the current frame.

(a) (b)

(c)

Fig-3.5 (a)Previous frame (b)Current frame (c) The difference of frames

In this thesis, moving objects segmentation is implemented to narrow down the amount of ROI for less computational cost. More specifically, the above ROI including the moving pixels from the difference between previous frame and current frame would be considered as a new ROI; on the other hand, the region would not be interested. Thus, moving objects segmentation could accelerate the system again.

3.2 Feature Extraction

After ROI selection, the system must extract necessary features from these ROI in order to increase the detection rate and decrease the computational cost. The overall

feature extraction is composed of two parts. First, the size of the selected ROI would be normalized into 64× 128 no matter how large the selected ROI is. Second, histogram of orientation gradient is computed to extract the human information which plays an important role for human detection.

3.2.1 Normalization

Because the size of selected ROIs are usually different, they would be processed difficultly and have no consistency. Therefore, the selected ROIs have to be resized into the identical size to avoid the worse detection rate and execution speed. The principle of image resizing (shrinking and zooming) is image interpolation which is basically image resampling method. Fundamentally, interpolation is the process of using known data to estimate values at unknown locations. Take a simple example, suppose that an image of size 500×500 pixels has to be enlarged 1.5 times to 750×750 pixels. A simple way to visualize zooming is to create an imaginary 750×750 grid with the same pixel spacing as the original, and the shrink it so that is fits exactly over the original image. Obviously, the pixel spacing in the shrunken 750×750 grid will be less than the pixel spacing in the original image. To perform intensity-level assignment for any point in the overlay, interpolation methods look for its closest pixel in the original image and assign the intensity of that pixel to the new pixel in the 750×750 grid. When interpolation methods are finished assigning intensities to all the points in the overlay grid, interpolation methods expand it to the original specified size to obtain the zoomed image.

In fact, there are two common image interpolation approaches to resize the image, a nearest neighbor interpolation approach and a bilinear interpolation approach.

The nearest neighbor interpolation approach assigns to each new location the intensity

of its nearest neighbor in the original image. This approach is simple, but it has tendency to produce undesirable artifacts, such as severe distortion of straight edges.

For this reason, it is used infrequently in practice. The bilinear interpolation approach is a more suitable approach, and it uses the four nearest neighbors to estimate the intensity at a given location. Let ( , )x y denote the coordinates of the location to which the approach wants to assign an intensity value, and let υ( , )x y denote that intensity value. For the bilinear interpolation approach, the assigned value is obtained using the equation which expressed as

( , )x y ax by cxy d

υ = + + + (3.5)

where the four coefficients are determined from the four equations in four unknowns that can be written using the four nearest neighbors of point ( , )x y . Take Fig-3.6 as an example, the bilinear interpolation approach gives much better results than nearest neighbor interpolation approach, with a modest increase in computational burden.

(a) (b)

Fig-3.6 (a)The result of nearest neighbor interpolation approach (b)The result of bilinear interpolation approach

In this thesis, the normalization method adopts the bilinear interpolation approach to resize the selected ROIs into the same size. Hence, the normalized ROI

could be processed more simply, and the consistency of normalized ROIs would increase the detection rate and execution speed.

3.2.2 Histogram of Oriented Gradient

Histogram of Oriented Gradients (HOG) is feature descriptor used in computer vision and image processing for the purpose of object detection. The technique counts occurrences of gradient orientation in localized portions of an image. The concept of HOG is similar to edge orientation histograms [35], scale-invariant feature transform descriptors, and shape contexts, but differs in that HOG is computed on a dense grid of uniformly spaced cells and uses overlapping local contrast normalization for improved accuracy.

Fig-3.7 the parameters of HOG for computation

In general, HOG is often used to detect objects of interest, and it would scan the whole image using multi-scale to detect objects with different size. However, in this thesis, HOG is just processed the normalized ROIs instead of being used in the above situation. Thus, the system could reduce the computational time for higher detection

speed. The HOG method presented in the thesis is proposed by Dalal and Triggs [16], and Fig-3.7 shows the parameters of HOG for computation. The detection window of HOG has the size 64×128, same as the normalized ROIs. Besides, the block with size 16×16 would move 8 pixels horizontally and vertically, and the cell with size 8×8 would be divided into 9 bins. Because a HOG descriptor represents a detection window, a detection window would contain 105 blocks by equation (3.6) and the vector length of HOG descriptor are 3780 dimensions by equation (3.7).

((128 16) / 8 1) ((64 16) / 8 1)− + × − + =105 (3.6) 1 block = 4 cells = 4 9 bins

the vector length of HOG descriptors = 105 4 9 = 3780

×

∴ × ×

7 (3.7)

Then, the HOG is computed by following steps:

Step-1: Compute both the magnitude and orientation of the gradient, which are defined as below: directions obtained by convolving the image, and the convolution operations are shown in the Fig-3.8, respectively.

Fig-3.8 (a) the horizontal operation (b) the vertical operation

Step-2: The gradient orientation of cell is evenly divided into 9 bins over 0 to 180° °.

The sign of the orientation is ignored, so the orientations between 180 and 360° ° are considered the same as those between 0 and 180° °. Then, the gradient orientation histogram E i j( , )k in each orientation bin k of block B i j( , ) are obtained by summing all the gradient magnitudes whose orientations belong to bin k in B i j( , )

( , ) ( , )

and the example of HOG is shown in Fig-3.9.

Fig-3.9 the example of HOG

After the HOG descriptors are extracted and computed, they are used as training patterns of human recognition system to determine whether any human is in the image or not.

3.3 Human Shape Recognition

In this section, the system has to judge whether the ROI contains human based on the extracted features. For higher detection rate and the ability of variation tolerance, this thesis adopts Leeds Sport Pose (LSP) dataset [36] which contains 2000 pose images of mostly sports people gathered from Flickr. Some examples of original LSP dataset are shown in Fig-3.10. Human shape recognition based on LSP dataset

could be divided into pre-processing and shape recognition. First, pre-processing includes data normalization and the classifier establishment. Then, shape recognition would be implemented through the classifier.

Fig-3.10 some examples of original dataset

In pre-processing, LSP dataset first is transferred to gray images, and then normalized into the size 64×128. Fig-3.11 shows some examples of normalized dataset. The examples in Fig-3.11 have same size no matter how large it is, so the example may be shrunk and zoomed.

Fig-3.11 some examples of normalized dataset

After every data is normalized, the features of normalized dataset would be extracted by HOG descriptors, and the vector length of feature is 3780. In other words, an input of recognition system should have 3780 dimensions. After the features of

dataset are extracted, the recognition approach would consider these features as positive training data and establish the classifier based on the class of training data. In this thesis, there are two approaches, including support vector machine (SVM) and artificial neural network (ANN). Because two approaches are supervised learning, the training data of humans and non-humans are required. Therefore, the HOG features of LSP dataset would be considered as positive training data of SVM, and the system collects 1328 non-human image as dataset randomly. These two approaches would be introduced below and their performances would be compared in Chapter 4.

 Approach 1: Support vector machine

In this section, the training data could be described simply in Fig-3.12, where the rectangles represent negative training data of SVM and the circles represent positive training data of SVM. One symbol implies the feature vectors and the feature vectors could be described as

[

1 2 3780

]

bin bin bin T vec =

feature  (3.11)

where bin is the result of HOG descriptors computing.

Fig-3.12 the distribution of training data of SVM

After the HOG descriptors are computed, the equations from (2.13) to (2.18) could be used to find out the maximum margin in Fig-3.12 and the optimal hyperplane.

The optimal hyperplane is considered as the classifier for human shape recognition of SVM. Therefore, the classifier would be established and shown in Fig-3.13.

Fig-3.13 the classifier for human shape recognition of SVM

 Approach 2: Artificial neural network

The second approach is using neural networks to classify the whole dataset, including positive training data and negative training data. The concept of ANN is similar to SVM, since ANN adopts dichotomy to train the classifier as well. The weights of neural network would be adjusted through the process of learning introduced in Section 2.2. After learning, the human can be recognized according to the output value of neural networks, and the neural network will be introduced below in detail.

The structure of neural networks is shown in Fig-3.14, which contains one input layer with 3780 neurons, first hidden layer with 420 neurons, second hidden layer with 30 neurons and one output layer with a neuron. The values of feature vector are

sent into the neural network as inputs. The 3780 neurons of the input layer are second hidden layer with weighting 2( , )

SI

W q r . Therefore, there exist a weighting array 1( , )

SI

W p q of dimension 3780× 420 and a weighting array WS2I( , )q r of dimension 420×30. Besides, the q-th neuron of first hidden layer is also with an extra bias 1 ( )

SI

b q , and the r-th neuron of second hidden layer is also with an extra bias

2 ( )

SI

b r Finally, the r-th neuron of second hidden layer is connected to output neuron with weighting 3( )

SI

W r , r=1,2,…,30, and a bias 3

SI

b is added to the output neuron.

Fig-3.14 the structure of neural networks

Let the activation function of first hidden layer be the hyperbolic log-sigmoid

transfer function and the output of q-th neuron 1 ( )

and let the activation function of second hidden layer be the hyperbolic log-sigmoid transfer function and the output of r-th neuron 2 ( )

SI

Thus, let the activation function of the output layer be the linear transfer function and the output is expressed as

30

and the process of above operations are shown in Fig-3.15. Therefore, the result of ANN would be 0 or 1, or the result represents the classifier for human shape recognition.

Fig-3.15 the process of ANN

As to shape recognition, the feature vector of test image would be sent into the classifier of SVM or ANN, and the classifications of feature vector are only human classification and non-human classification. As a result, the system could judge the ROI whether includes the human or not.

3.4 Motionless Human Checking

In section 3.1.3, moving objects segmentation is implemented to narrow down the amount of ROI for less computational cost. However, the ROI may be under the situation which the ROI has been judged to contain a human in the previous frame but the same ROI is without motion in the current frame. In other words, the ROI under this situation should contain the human, yet the system would filter out it in the current frame. Therefore, the system has to check whether the ROI is under this situation and contains a human or not.

Fig-3.16 the process of parameters of ROI storing

Fig-3.16 the process of parameters of ROI storing

相關文件