Shape Recognition - ROI Selection - Intelligent Human Detection

Chapter 3 Intelligent Human Detection

3.1 ROI Selection

3.3.2 Shape Recognition

After chamfer matching, the system has to judge whether the ROIs contain human or not based on the coordinates of matched regions of different parts of body.

Because of the ability of variation tolerance, chamfer matching has high true positive rate to correctly detect most of real parts of body, but also has unwanted high false positive rate to misjudge other objects as parts of body. To cut down the false positive rate, the concept of shape recognition is used in the following process. Since the relations between different parts of body are fixed, these parts could be combined based on their geometric relation. For example, if a head could be combined with a torso, it is reasonably to know that the possibility of containing human would increase.

(a) (b)

Fig-3.18 Two different template sets. (a) Set-I (b) Set-II

Set-I Set-II

FT FL

LH RH

LT RT LL RL

On the contrary, if a head couldn’t be combined with other adjacent parts of body, it might be other object, not a human head. In the thesis, there are two recognition approaches, including voting-based and neural-network-based approaches. These two approaches would be introduced below and their performances would be compared in Chapter 4.

Approach 1: Voting-based recognition

In this section, Set-II is used as an example to introduce how voting-based approach works and the scheme of voting-based recognition is shown in Fig-3.19. In order to deal with occlusion problems, a whole human shape is separated into four groups, which are left-, right-, upper-, and lower-group. If a part of body could be combined with an adjacent part, the ROI would have more possibility to contain a human. Take left-head as an example, if left-head could be combined with left-torso, the left-group and upper-group could get one vote. Similarly, if left-head could be combined with right-head, the upper-group would have one more vote. All the relations between two adjacent parts of body, e.g. left-head to right-head, right-torso to right-leg, etc., would be checked. If their relation is reasonable, the corresponding body group would have one more vote. After finding the votes of four groups, the occlusion judgment discovered in Section 3.1.2 would also be added as a kind of feature. The occlusion judgment is used to adjust the proportions of four groups. For instance, if the occlusion judgment is left-occlusion, the proportion of right-group would be increased and the proportion of left-group would be decreased. Similarly, if the occlusion judgment is frontal-occlusion, the proportion of upper-group and lower-group would be enhanced and the proportion of left-group and right-group would be reduced. After adjusting the proportions, the system sums up these four

votes to get final vote. If the final vote exceeds a threshold, e.g. half of total votes, the ROI would be regarded as human, and vice versa. Note that the process of voting-based recognition with Set-I is similar to the process introduced above. The concept of voting-based approach is straight and easy to implement. However, the relations between adjacent parts of body and the threshold have to be determined manually.

Approach 2: Neural-network-based recognition

The second approach is using a neural network to combine different parts of body. The concept of neural-network-based recognition is similar to voting-based recognition. If a part of body could be combined with an adjacent part, the possibility of containing human would enhance. In supervised learning, the training data of

Fig-3.19 Scheme of voting-based recognition Voting-based recognition result Occlusion Judgment

humans and non-humans are required. Therefore, the images of humans with different poses and images of other objects are collected and processed through the steps in Section 3.1 and 3.2. After chamfer matching, the matched coordinates of different template images are recorded as training data. In this thesis, there are totally 1500 training data, including 500 positive data and 1000 negative data. The weights of neural network would be adjusted through the process of learning introduced in Section 2.3. After learning, the human can be recognized according to the output value of neural network. Two neural networks are proposed in the thesis, named as Set-I and Set-II neural network, which will be introduced below in detail.

The structure of Set-I neural network is shown in Fig-3.20, which contains one input layer with 6 neurons, one hidden layer with 12 neurons, and one output layer with 1 neuron. After chamfer matching, the coordinates of matched regions of FH, FT and FL are recorded as (𝑥FH, 𝑦FH), (𝑥FT, 𝑦FT) and (𝑥FL, 𝑦FL). The differences between these three coordinates at x- and y-coordinate are computed and sent into the neural network as inputs. The 6 neurons of the input layer are represented by 𝑆𝐼(𝑝), p=1,2,…,6, correspondingly. The p-th input neuron is connected to the q-th neuron, q=1,2,…,12, of the hidden layer with weighting 𝑊_𝑆¹_𝐼(𝑝, 𝑞). Therefore, there exists a weighting array 𝑊_𝑆¹_𝐼(𝑝, 𝑞) of dimension 6×12. Besides, the q-th neuron of the hidden layer is also with an extra bias 𝑏_𝑆¹_𝐼(𝑞). Finally, the q-th neuron of the hidden layer is connected to output neuron with weighting 𝑊_𝑆²_𝐼(𝑞), q=1,2,…,12, and a bias 𝑏_𝑆²_𝐼 is added to the output neuron.

Let the activation function of the hidden layer be the hyperbolic log-sigmoid transfer function and the output of q-th neuron ¹ ( )

Let the activation function of the output layer be the linear transfer function and the output is expressed as

The above operations are shown in Fig-3.21.

Fig-3.20 Structure of Set-I neural network 6 neurons

12 neurons

1 neuron

For Set-II neural network, there are one input layer with 18 neurons, one hidden layer with 30 neurons, and one output layer with 4 neuron as shown in Fig-3.22.

Similar to Set-I neural network, the inputs of Set-II neural network are the differences between coordinates of different parts of body at x- and y-coordinate. The 18 neurons of the input layer are represented by 𝑆𝐼𝐼(𝑝), p=1,2,…,18, correspondingly. The p-th input neuron is connected to the q-th neuron, q=1,2,…,30, of the hidden layer with weighting 𝑊_𝑆¹_𝐼𝐼(𝑝, 𝑞). Hence, there exists a weighting array 𝑊_𝑆¹_𝐼𝐼(𝑝, 𝑞) of dimension 18×30. Besides, the q-th neuron of the hidden layer is also with an extra bias 𝑏_𝑆¹_𝐼𝐼(𝑞).

Finally, the q-th neuron of the hidden layer is connected to the r-th neuron, r=1,2,3,4, of output layer with weighting 𝑊_𝑆²_𝐼𝐼(𝑞, 𝑟), and a bias 𝑏_𝑆²_𝐼𝐼(𝑟) is added to the output neurons. These four output neurons represent the performances in left-, right-, upper- and lower-group, respectively. Similar to voting-based recogntion, the occlusion judgment is added to adjust the proportions of four groups and then these performances are summed up as final performance. If the final performance exceeds a threshold, the ROI would be regarded as human, and vice versa.

Fig-3.21 Set-I neural network

Let the activation function of the hidden layer be the hyperbolic log-sigmoid transfer function and the output of q-th neuron ¹ ( )

SII

Let the activation function of the output layer be the linear transfer function and the output of r-th neuron ² ( )

The above operations are shown in Fig-3.23.

‧

Fig-3.22 Structure of Set-II neural network with occlusion judgment

Fig-3.23 Set-II neural network S_II(p) ╳

𝑾_𝑺^𝟏_𝑰𝑰(𝒑, 𝒒)

𝒃_𝑺^𝟏_𝑰𝑰(𝒒)

╳

𝑾_𝑺^𝟐_𝑰𝑰(𝒒, 𝒓)

𝒃_𝑺^𝟐_𝑰𝑰(𝒓)

┼ ┼

Chapter 4 Experimental Results

In the previous chapters, the three main steps of the proposed human detection system are introduced. In this chapter, the experiment results of each step will be shown in detail and the results of the proposed algorithm will be obtained by MATLAB R2010b and OpenCV 2.2.

4.1 ROI Selection

In order to examine the reliability of ROI selection, the system is tested in many different situations, including different poses, occlusion by other objects, more than one human and complex background. The results are shown from Fig-4.1 to Fig-4.4 and all these four figures have three columns. The left column contains the original depth images, the middle one shows the ROI images after CCL, and the right one represents the results of ROI selection. Note that the red rectangles in the middle and right columns are the selected ROIs. These regions would be extracted and further processed in the following steps. The human in Fig-4.1 has different poses, including walking, waving hands, etc. As long as the human keeps standing or walking, the system would not fail to extract human region. In Fig-4.2, there are one human and one chair in the images, and they might occlude each other. But the system also could detect the human regions and separate human and chair as two distinct objects.

Fig-4.1 Results of ROI selection in the condition of different poses. (a) The original depth images (b) The ROI images after CCL (c) The results of ROI selection.

Note that the rectangles in (b) and (c) are the selected ROIs.

(a) (b) (c)

The situations in Fig-4.3 and Fig-4.4 are more complex. In Fig-4.3, there are more than one human standing in front of the camera, and they might stand side by side or occlude each other. The system still could extract human regions and separate them as distinct objects even when suffering from serious occlusion. In Fig-4.4, the ROI selection is tested in complex background and there are a lot of small dot-like regions in the ROI images. Through CCL, the system could filter out these regions to reduce the number of ROI and still success to extract the human regions.

Fig-4.2 Results of ROI selection in the condition of one human and one chair

(a) (b) (c)

Fig-4.3 Results of ROI selection in the condition of more than one human

(a) (b) (c)

Fig-4.4 Results of ROI selection in complex background

(a) (b) (c)

4.2 Feature Extraction

In this section, the experimental results are presented in two parts. The first part focuses on the performance of normalization. The second part shows the results of edge detection and distance transformation.

4.2.1 Normalization

The objective of normalization is attempting to reduce the influence of distances because the same object at different distances would have different sizes in the image.

In order to examine the function of normalization, a human with 170cm height is standing at different distances as shown in Fig-4.5(a), where the distances between the human and camera are 1.6m, 2.0m, 2.4m, 2.8m, 3.2m and 3.6m from top to bottom.

Fig-4.5(b) is the result of ROI selection and then the human regions are extracted as shown in Fig-4.5(c), where the same human at different distances would have different sizes. Besides, the standard distance is set to be 2.4m. Based on (3.11), all the human regions in Fig-4.5(c) are normalized and resized into similar size. The result of normalization is shown in Fig-4.5(d). In order to compare the result of normalization more clearly, images in Fig-4.5(d) are lined in a row as presented in Fig-4.6. Obviously, the influence of distance is highly reduced. Note that all the selected ROI would be normalized not only the human regions. This section just uses the human as an example.

Fig-4.5 The same human standing in 1.6m, 2.0m, 2.4m, 2.8m, 3.2m and 3.6m from top to bottom. (a) Original depth images (b) The results of ROI selection (c) The extracted human regions. (d) The results of normalization

(a) (b) (c) (d)

4.2.2 Edge Detection and Distance Transformation

The results of edge detection and distance transformation are shown from Fig-4.7 to Fig-4.10. Take Fig-4.7 as an example, Fig-4.7(a) is the outcome of ROI selection and then the selected ROIs would be separated as shown in Fig-4.7(b).

Following, the ROIs would be normalized and resized based on the distance between object and camera as presented in Fig-4.7(c). Finally, edge detection and distance transformation are implemented and the results are shown in Fig-4.7(d) and Fig-4.7(e), respectively. Fig-4.8, Fig-4.9 and Fig-4.10 are presented in the same way.

Fig-4.6 Comparison of the result of normalization. The human is originally standing at 1.6m, 2.0m, 2.4m, 2.8m, 3.2m and 3.6m from left to right.

(a) (b) (c) (d) (e)

Fig-4.7 Result of edge detection and distance transformation in the condition of walking pose

Fig-4.8 Result of edge detection and distance transformation in the condition of more than one human

(a) (b) (c) (d) (e)

Fig-4.9 Result of edge detection and distance transformation in complex background

4.3 Human Recognition

In this section, the human recognition system would be tested in different situations to examine the performance and reliability. Before presenting the results, it is required to introduce the method for evaluating the results. In general, the major objective of a detection system is to detect humans from an image or a sequence of images. In human detection, there are four possible events given in Table 4.1, including True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN). These four events are determined based on the actual condition and

(a) (b) (c) (d) (e)

Fig-4.10 Result of edge detection and distance transformation in the condition of one human and one chair.

test result, and they are listed as below:

1. True Positive, TP, means a real human is detected as human.

2. True Negative, TN, means a non-human is detected as non-human.

3. False Positive, FP, means a non-human is detected as human.

4. False Negative, FN, means a real human is detected as non-human.

With these four events, the true positive rate TPR and false positive rate FPR can be respectively defined as below:

A true positive rate of 100% means all humans are detected correctly, while a false positive rate of 0% means any non-human is not detected as human. To compare the performance of the system, the accuracy rate AR is defined as below:

TP+TN

AR= 100%

TP+TN+FP+FN× (4.3)

and a higher AR implies a better detection performance.

Table 4.1 TP, FP, FN, TN table

In order to examine the robustness of the human recognition system, many test images in different situations are collected. The possible situations could be roughly separated into three cases: different poses (DP), occlusion by other objects or humans (OC) and complex background (CB). In this thesis, the overall test image set, which contains 2714 test images, are separated into three groups, including 980 images in DP group, 1114 images in OC group and 620 images in CB group, as shown from Fig-4.11 to Fig-4.13. Through separating them apart, it is simple to observe and compare the reliability of the system in these situations.

Fig-4.11 Examples of test images in DP group

Fig-4.12 Examples of test images in OC group

After ROI selection and feature extraction, the selected ROIs and extracted features would be sent into the human recognition system to get results. Note that there are totally 9173 selected ROIs in test image set, including 1972 in DP group, 3842 in OC group and 3359 in CB group. In this thesis, there are two template sets, Set-I and Set-II, and two recognition approaches, voting-based approach and neural-network-based approach. Hence, there are four different methods, which are Set-I-Voting, Set-I-NN, Set-II-Voting and Set-II-NN. The performances of these methods in different test groups are shown in Table 4.2, including TPR, FPR and AR.

Moreover, Table 4.3 shows the TPR, FPR and AR of overall test images and the average executing time. Through these two tables, there are some conclusions that we could get:

 It is obvious that the accuracy rate of Set-II is higher than the accuracy rate of Set-I, especially in the OC group. Under slight occlusion, Set-I and Set-II both have good performance. Unfortunately, when suffering from serious occlusion, the accuracy rate of Set-I would drop obviously. However, the computational cost of Set-I is lower than Set-II and the average executing time of Set-I is lower than 0.1sec.

Fig-4.13 Examples of test images in CB group

 The performance of neural-network-based approach is better than the performance of voting-based approach. The concept of voting-based approach is straight and it is easy to implement. However, it couldn’t handle all the poses and situations because the definitions of the relations between different parts of body are brief. As for neural network, it could adjust its weight to difficult situations through the process of learning, but the training data has to be prepared and selected in advance.

Table 4.2 Comparison of performances in DP-, OC- and CB-group

DP OC CB

TPR FPR AR TPR FPR AR TPR FPR AR

Set I-Voting 89.21 0.90 94.22 81.04 4.12 88.55 85.43 8.77 90.12

Set II-Voting 92.81 0.60 96.15 89.73 3.09 93.36 89.61 7.41 92.02

Set I-NN 91.06 0.80 95.18 84.73 2.83 91.02 87.60 4.79 93.75

Set II-NN 94.86 0.40 97.26 92.05 2.06 95.03 92.25 3.32 95.83 DP=Different Poses. OC=Occlusion. CB=Complex Background. (%)

Table 4.3 Performances and average executing time

TPR FPR AR Executing Time

Set I-Voting 84.11% 5.78% 90.34% 0.089s

Set II-Voting 90.56% 4.72% 93.74% 0.122s

Set I-NN 87.01% 3.41% 92.91% 0.092s

Set II-NN 92.86% 2.37% 95.80% 0.131s

Chapter 5 Conclusions and Future Works

This thesis proposes an intelligent human detection system based on depth information generated by Kinect to find out humans from a sequence of images and resolve occlusion problems. The system is divided into three parts, including ROI selection, feature extraction and human recognition. First, the histogram projection and connected component labeling (CCL) are applied to select the ROIs according to the property that human would present vertically in general. Through histogram projection, the system could generate the rough vertical distribution in 3-D space.

Therefore, if the height of object exceeds a certain threshold, the object would be selected as an ROI and marked by CCL. Then, normalize each ROI based on its distance to camera and extract the human shape feature by the edge detection and distance transformation to obtain the distance image. Finally, the chamfer matching is used to search possible parts of human body under component-based concept, and then shape recognition is implemented by neural network according to the combination of parts of human body. From the experimental results, there are some conclusions listed as below:

 The proposed system could detect human with accuracy rate higher than 90% and average executing time about 0.1sec/frame. Besides, with the help of depth image and component-based concept, the system could also detect humans correctly even suffering from serious occlusion.

 The use of depth image to implement human detection would have some distinct advantages over conventional techniques. First, it is robust to illumination change

and influence of distance. Second, it could deal with occlusion problems efficiently.

Third, it is suitable for moving camera because no background modeling is required.

 The use of chamfer matching to achieve significant human features could highly reduce the dimension and size of the neural network. The conventional pattern recognition often directly applies a patch of image or the whole pixels of an ROI into the neural network. Consequently, the neural network requires hundreds and thousands of neurons in its input layer and a whale of training data for training.

With pre-processing via chamfer matching, the number of neurons in the input layer could be reduced to less than 50.

In order to improve the human-robot interaction, there are three functions often required for a robotic system, including human detection, human tracking and pose detection. With these three functions, the robot could detect humans in the image, track specific humans and interact with them based on their poses. Therefore, the interaction between human and robot could be more accurate and natural. In this thesis, the proposed system has been demonstrated to be successful in human detection. In the future, all the schemes developed in this thesis will be further applied

在文檔中基於深度資訊之智慧型人形偵測系統設計 (頁 49-0)