Applying the improvements described in Sections 4.1 and 4.2

Chapter 4 Enhancement of line-based people localization

4.3 Early screening for line correspondence

4.4.1 Applying the improvements described in Sections 4.1 and 4.2

In this subsection, improvements described in Section 4.1 are evaluated with several different videos taken from both indoor and outdoor scenes, with different degrees of occlusion.

Comparisons with [25] and [26] are also included to show the proposed method can achieve comparable correctness/accuracy in localization but with much higher computation speed.

Additionally, we investigate the performance of the proposed method with different numbers of cameras and densities of line samples in an image.

4.4.1.1 Experiments for different degrees of occlusion with indoor/outdoor sequences

The performance evaluation is implemented under Windows 7 with 4 GB RAM and a 2.4G Intel Core2 Duo CPU, without using any additional hardware. Table 4.1 summarizes detailed localization results of the proposed 3D line reconstruction method as well as three other methods.

In addition to our previous work [26], a modified version¹⁴ of the approach proposed in [25], is also implemented and tested. The proposed approach achieves the highest recall rates for S1-S3 while the other three methods achieve the highest precision rates for the three video sequences.

Similarly, very small difference (within 0.65cm) among results obtained from these three methods can be found for the accuracy of derived people location except for the method proposed in Chapter 3. One can see the 3D line reconstruction method can achieve higher recall and precision rates (+3%) than the method described in Chapter 3 for S3. Overall, the mean value and standard deviation of (x-y) location errors of the proposed method for S1-S3, together, are equal to 10.70 cm and 5.90cm, respectively, which can hopefully be regarded as sufficient for many surveillance applications¹⁵.

As for the computational speed, in frames per second (FPS), the values for different cases listed in Table 4.1 are evaluated without including the cost of foreground segmentation. One can

14 In our implementation, which also does not perform people tracking, binary images of foregrounds are adopted as system input as the other two algorithms. A grid size of 100 × 100 is chosen for each of the twenty reference planes, with 10cm grid spacing. A grid point on the ground is regarded as occupied if more than T_acc = 11 grid points with the same horizontal coordinates (but on reference planes of different heights) correspond to image foreground in all (4) views. Then, connected component analysis is applied to identify connecting occupancy regions. The connected occupancy regions with very small areas, i.e., smaller than 22%

of average area of such regions, are regarded as noise and are removed.

15 The errors are only calculated for correctly detected people locations, which contribute to the precision rates listed in Table 3.1, i.e., with location errors less than 30cm.

Method Recall Precision Mean error

(cm) Frame per second

see that speed-up of more than an order of magnitude from the method in [25] can be achieved by the proposed approach, with as much as 70 times acceleration (near 2.7 times in speed improvement from our previous approach in [26]) in the process speed of S1. While real-time performance can be achieved for S1 and S2, the computation speed is down to a near real-time 21.61 FPS when the number of people increases to twelve¹⁶. Note that for [25] the computation times (in FPS) are about the same for different cases. This is because the time complexity in the generation of synergy maps is mainly depends on the size of each image frame and the total number of views. In addition, the computation speed of the 3D line construction method is quite similar to the method described in Chapter 3. However, the 3D line construction method can achieve higher recall and precision rates if the scene is more crowded, i.e., S2 and S3.

Although the above evaluations show that the proposed method can often provide reasonable good localization results, there are extreme cases of poor foreground segmentation which cannot be well handled with the proposed method. Figs. 4.6(a)-(h) show localization results and foreground regions for the 51th frame of S1. In Figs. 4.7(a) and (e), one can see the foreground segmentation of a person (in red circle) is very poor because of reflections as well as clustered

16 This is because the computational time is dominated by the number of 2D line samples, which will grow with the area of foregrounds.

Fig 4.6. A failure example of the proposed method. (a)-(d) The localization results (illustrated with bounding boxes) of four views. (e)-(h) Corresponding foreground regions and 2D line samples.

(i) 3D line samples to represent different persons in the scene.

(a) (b) (c)

Fig. 4.7. An example of miss detections and false alarms of S3. (a) Segmented foreground regions and 2D line samples. (b) 3D line samples to represent different persons in the scene. (c) The localization results illustrated with bounding boxes. Note that corresponding colors are used in (b) and (c) for different groups/bounding boxes after grouping.

(a) (b) (c) Fig. 4.8. Localization results for scenario S4.

(a) (b) (c) Fig. 4.9. Localization results for scenario S5.

background (see green arrows). Consequently, lesser 3D line samples are retained after the screening process, as show in Fig. 4.6(i), resulting a failure. Since some more 3D line samples can still be reconstructed correctly for that person as different time instances, erroneous results are generated for only 3 out of 20 frames (from frame 41 to frame 60), compared with 13 erroneous frames obtained from the method in [25].

On the other hand, problematic results may also be generated due to very serious occlusions.

Firstly, as shown in Fig. 4.7, there may be ground region that is covered by foregrounds in all views. No matter a person does exist or not, a 3D MA will be generated. If such a 3D MA cannot be filtered out by the aforementioned geometric rules, a false alarm will occur (see the yellow arrow in Fig. 4.7(c))¹⁷. Secondly when the distances between people are too small (see the red arrow in Fig. 4.7(c)), their 3D line samples will be grouped into the same group (see Fig. 4.7(b)) resulting in two miss detections and one false alarm. This is because, for localization efficiency, the grouping scheme only determines whether the distance between two line samples is smaller than a threshold when grouping 3D line samples¹⁸. (More detailed discussion of the effect of the distance threshold can be founded in Appendix B.)

To further evaluate our method for outdoor environment, S4 and S5 are captured from a real scenario with image resolution of 360 × 240. In general, working in such an environment may be challenging for visual surveillance systems since there are more time varying factors such as illumination for object, speed of wind, and shadows of various strength. For the real scene under consideration, groups of people of different sizes are walking quickly through the monitored area¹⁹ (green polygons in Figs. 4.8 and 4.9). Thus, less image frames are captured for S4 and S5 than those in S1-S3. Figs. 4.8 and 4.9 show snapshots of localization results for S4 and S5, respectively, with more statistics summarized in Table 4.2. One can see that the correctness/accuracy level similar to that shown in Table 4.1 can be achieved with the proposed approach except for larger differences between (i) recall and precision rates for S4, and (ii) mean localization errors for S4 and S5. Such differences may result from higher probability of the aforementioned occlusions for people walking together along a passage and/or complexities associated with an outdoor scene.

In practice, due to significant differences between the indoor and outdoor scenes where video sequences S1-S3 and S4-S5 are captured, respectively, different parameter values may need to be selected to achieve desirable localization results. In the next Subsection (4.4.1.2), effects of choosing different densities of 2D line samples in each image, as well as incorporating different

17 Such a problem may be eliminated by adopting additional temporal information.

18 To partially resolve this problem, a heuristic scheme is applied in our method. If a group containing a larger number of 3D line samples, it will be divided into two groups. Specifically, we calculate the average number of 3D line samples, NC, in all groups, and divide a group into two groups if it contains more than more than 2NC line samples.

19 It is assumed that the evaluation of people localization is only preformed for the monitored area.

Table 4.2. Localization results of sequences S4 and S5.

Sequence

Number of frames/

persons

Method Recall Precision Mean error

(cm) Frame per second

Table 4.3. Results of using different numbers of cameras.

Number of cameras 3 4 5

Recall 95.4% 96.2% 98.3%

Precision 85.7% 95.0% 96.6%

Localization error (cm) 11.30 10.70 10.13

Frames per second 75.57 29.48 24.16

Fig. 4.10. Results of using different line densities (pixel-spacings, see text) with four cameras. (a) Recall and precision. (b) Localization error. (c) Computation speed.

numbers of camera in the proposed localization system, will be investigated (only for the indoor scene for brevity). While the two associated parameters will determine the initial amount of data to be processed by the proposed algorithm, other parameters will be used to tune the algorithm for

better performance under different environmental conditions, as will be discussed in Appendix B.

4.4.1.2 Experiments for different numbers of cameras and densities of sampling

To investigate the relationship between performance of localization and the numbers of cameras, the indoor scenarios S1-S3 are examined with an additional view captured from a different camera, and the results are presented in Table 4.3. One can see that while similar recall rates can be obtained by using different numbers of cameras, the precision rate of using three cameras is much lower than if four or five cameras are used. This implies that using only three cameras may not be sufficient when there are serious occlusions. In addition to above performance indices, adding more cameras also improves the system performance in terms of the localization accuracy. However, if slight degradations in these performance indices are acceptable, a set of four cameras may be used if hardware (cameras) cost is of major concern.

In order to investigate the influence of densities of sample lines in an image on the localization performance, a very simple sampling scheme is adopted in our method. In particular, the line samples are originated from the vanishing point to equally-spaced image pixels at the bottom row of the captured image. Fig. 4.10(a) shows the decreases of both the recall and precision rates with such pixel-spacing²⁰. One can see that for spacing less than ten, similar recall and precision rates can be obtained, and a larger spacing seems to capture inadequate information for localization. Fig. 4.10(b) shows that the localization errors are growing slightly with pixel-spacing. Whether the localization errors due to different pixel-spacings are acceptable will depend on applications under consideration. Finally, Fig. 4.10(c) shows the growth of computation speed with pixel-spacing. Again, the choice among different pixel-spacing will depend on the Fig. 4.7), additional interferences from non-human foreground objects (vehicles) include: (i)

20 While a spacing of 5 pixels is selected for S1-S3, a spacing of 4.4 pixels is selected for S4-S5.

21 Similar to the experiments conducted on S4 and S5, the evaluation of people localization is only preformed for the monitored area, and the image resolution is 360 × 240.

vehicle-people occlusion and (ii) presence of vehicles in the monitored area. While (i) can be seen in all four views but does not result in a problem in this case, (ii) does cause a false alarm (shown as a big (dark purple) group in Fig. 4.11(i)). Overall, the recall and precision rates for this challenging scene are evaluated as 80.9% and 80.2%, respectively, for a total of 108 image frames.

4.4.1.4 Summary

Instead of using all foreground pixels, line samples from multiple views are used to find possible 3D line samples of human body efficiently. While our earlier approach in [26] is a direct extension of the approach in [25] in that projection of pixels (lines in [26]) are computed for horizontal planes first, the algorithm presented in this Section 4.1 reconstructs the above samples in the 3D space directly. Additional efficiency of the proposed approach arises from effective screening of these 1D samples using new geometric constraints of the body. Such efficiency is crucial for certain surveillance applications which demand prompt attention (and high processing speed) with people localization being part of the complete process²². Experimental results demonstrate that the proposed method can handle serious occlusions in quite crowed scenes to provide localization results with correctness and accuracy, and localization accuracy, comparable to that attained with a modified version of [25], but with much higher processing speed.

Additionally, because the proposed localization approach is based on 3D reconstruction/sampling, it is possible to extend the approach to track people in the 3D space²³.

在文檔中以多攝影機進行人物定位 (頁 42-48)