Experimental Results - 利用足跡分析來加速基於消失點射線取樣之人群定位演算法

Our experiments are conducted for three QVGA resolution (360240) video sequences with 30 frames per second; each has four camera views of an indoor scene under different degrees of occlusion. The calibration poles are placed vertically on the ground of the scene beforehand, for the estimation of vanishing points, and multiple homographic matrices. These sequences are captured with different numbers and trajectories of people. Table 5.1 shows the detailed information for three testing sequences, named S1, S2, and S3, respectively. The average distance between the cameras and the monitored area is about 15m. The computation is performed with a PC under Windows 7 with 4 GB DDR3 RAM and a 2.4G Intel i5 M520 CPU, without using any additional hardware.

Table 5.1 The information of three video sequences Sequence Number of frames Number of persons

S1 691 9 conditions are quite complicated. The sun light may come through the windows directly and the reﬂections from the ﬂoor can be seen clearly. A total of 691 frames are captured for S1 wherein eight persons are walking periodically around the ninth one standing near the center of the monitored area. Figure 5.1(b) and (c) show the verified 3D line samples and the MAs to represent people localization results, and view from a slightly higher elevation angle, which

easily correspond to camera view 1 (the left most one in Figure 5.1(a)). In addition, for a closer examination of the correctness of the proposed people localization and the height estimation scheme, bounding boxes with a ﬁxed cross-section of 50cm × 50cm, and with their heights obtained, are back-projected to the captured images, as shown in Figure 5.1(d). One can see that these bounding boxes do overlay nicely with the corresponding individuals. The recall and precision rates for the whole sequence are evaluated as 95.8% and 95.7%, respectively.

Figure 5.1 An instance of sequence S1, frame 1 (9 persons, eight circling the center one). (a) Input frame from four different viewing directions. (b) Verified 3D line samples of different clusters in the scene. (c) 3D major axes (MAs) to represent different persons in the scene. (d) Localization results illustrated with bounding boxes.

Figure 5.2(a) shows an instance of sequence S2, which has the same people count as that for S1, but the nine people are walking randomly in the scene. While S2 may have more serious occlusions in some time intervals, the repeated occlusions caused by periodic walking pattern in S1 do not occur in S2. As a result, both the average recall and precision rates are increased slightly. To further examine the robustness of our method under serious occlusion, sequence S3 is evaluated, which is similar to S2 but having twelve persons randomly walking in the scene. While satisfactory localization results are obtained in Figures 5.3, the recall and precision rates for S3 are decreased to 92.9% and 91.2%, respectively. As the localization results of above sequences summarized in Table 5.2, the proposed approach seems to work robustly despite some degradation in localization accuracy for serious occlusion.

Figure 5.2 An instance of sequence S2, frame 1 (9 persons, walking randomly).

Figure 5.3 An instance of sequence S3, frame 1 (12 persons, walking randomly).

Table 5.2 Performance of the proposed approach

Sequence Recall Precision Average error FPS

S1 95.8% 95.7% 11.86cm 236.36

S2 96.2% 96.2% 10.57cm 231.62

S3 92.9% 91.2% 11.25cm 181.96

In our experiments, the “Recall” and “Precision” are defined by

𝑜 𝑒 𝑒 𝑒 𝑜𝑛 ^{𝑜 𝑒} , (16) 𝑜 𝑒 𝑙 𝑒 𝑙 ^{𝑜 𝑒} . (17) An estimated location at a distance less than 30cm from the ground truth is regarded as correct, and the “Average error” gives the average distance between the estimated people

locations and those of ground truth produced manually. The precision and recall rates in all the three videos are above 90%. The computational speed, in frames per second (FPS), are evaluated without including the cost of background subtraction. The proposed approach achieves very high computational efficiency, even for the crowded scene S3, wherein 12 persons can be located quite accurately at a high processing speed of about 180 fps. The FPS degradation in S3 is because the computational time is dominated by the number of 2D line samples, which will grow with the area of foregrounds. As for the accuracy, the average error is lesser than 12cm, respectively, which can hopefully be regarded as sufficient for many surveillance applications

Although the above evaluations show that the proposed method can often provide reasonably good localization results, there are extreme cases which cannot be completely handled with the proposed method. Firstly, when the foreground regions are broken at leg level, the initial 3D line sample will not be generated and the miss detection will occur. For the example shown in Figure 5.4, the person with a blue shirt cannot be detected since the broken foreground region is at his leg level and this position will not be taken as a candidate people location (CPL). Secondly, while the scene is under very serious occlusions, e.g., in Figure 5.5, the ground region may be covered by foreground regions in all views and a false alarm will occur (see the circled regions in Figure 5.5(a) and (b)). No matter a person does exist or not, a 3D line sample will be generated. If such a 3D line sample cannot be ﬁltered out by the aforementioned veriﬁcation procedure, a false alarm will occur (see the 3D MA in red in Figure 5.5(c)). Finally, when the distances between people are too small, their 3D MAs will be clustered into the same group. And this will lead to two miss detections and one false alarm, as shown in Figure 5.6. For localization efﬁciency, the BFS scheme for clustering only determines whether the distance between two MAs is smaller than a threshold.

Figure 5.4 An example of miss detection in sequence S2. (a) Segmented foreground regions and 2D line samples. (b) The localization results wherein the person with blue shirt cannot be detected because of the broken foreground region at his leg level.

Figure 5.5 An example of false alarms in sequence S2. (a) Segmented foreground regions and 2D line samples in all views. (b) The localization results illustrated with bounding boxes in all views. (c) The 3D MAs to represent different persons in the scene. The 3D MA in red represents a false alarm.

Figure 5.6 An example of miss detections and false alarms in sequence S2. (a) The localization results illustrated with bounding boxes. (b) Clusters of verified 3D line samples in the scene with the circled region indicating the merge of two clusters.

For performance comparison with previous research, similar results of people localization obtained in [9] are listed in Table 5.3 (compared with Table 5.2). One can see that the approach proposed in this thesis achieves similar precision and recall rates as in [9].

However, the processing speed is significantly enhanced, about ten times faster than [9], due to the use of projected 2D foreground line samples on ground, instead of reconstructing 3D major axes via computing pairwise intersections of sample lines of image foreground projected at different heights.

Table 5.3 Performance of people localization of [9].

Sequence Recall Precision Avg. error FPS

S1 93.7% 95.1% 11.07 cm 26.69

S2 94.3% 94.1% 9.56 cm 26.33

S3 92.3% 91.9% 9.57 cm 18.09

The results of person height estimation for S1 are shown in Figure 5.7. The red squares indicate the actual heights and blue dots represent the estimated heights together with intervals of unit standard deviations. Figure 5.8 shows similar results of person height estimation for S2 can be obtained. However, when the occlusion becomes more serious, the performance of height estimation of S3 is degraded as shown in Figure 5.9. More detailed data of people height estimation can be found in Table 5.4, Table 5.5 and Table 5.6 for S1-S3.

Figure 5.7 Results of person height estimation for S1.

Figure 5.8 Results of person height estimation for S2.

Figure 5.9 Results of person height estimation for S3.

Table 5.4 Results of person height estimation for S1.

P1 P2 P3 P4 P5 P6 P7 P8 P9

Actual 167 166 173 170 174 171 174 180 172 Average 168.5 167.1 174.2 170.0 170.0 173.8 173.0 178.3 171.7

Std 5.3 5.1 6.6 5.6 5.7 5.3 6.5 4.9 7.6 Error 1.5 1.1 1.2 0.0 4.0 2.8 1.0 1.7 0.3

Table 5.5 Results of person height estimation for S2.

P1 P2 P3 P4 P5 P6 P7 P8 P9

Actual 170 171 180 166 174 172 167 174 173 Average 172.4 176.2 174.9 165.7 169.9 174.1 171.3 173.5 173.8

Std 4.3 5.2 5.1 5.7 6.6 6.6 6.0 5.8 3.7 Error 2.4 5.2 5.1 0.3 4.1 2.1 4.3 0.5 0.8

Table 5.6 Results of person height estimation for S3.

P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12

Actual 174 180 191 166 174 173 173 170 172 167 168 171 Average 168.9 171.4 186.8 164.3 178.7 183.9 171.9 167.9 173.8 170.4 170.8 176.1

Std 3.6 6.6 9.0 7.3 9.1 9.5 6.4 2.6 5.4 3.4 7.2 5.3 Error 5.1 8.6 4.2 1.7 4.7 10.9 1.1 2.1 1.8 3.4 2.8 5.1

在文檔中利用足跡分析來加速基於消失點射線取樣之人群定位演算法 (頁 40-49)