People localization and height estimation

Chapter 3 Acceleration of vanishing point-based line sampling scheme for people localization

3.2 People localization and height estimation

In this section, to achieve the goal of people localization and height estimation, vertical line samples of human body are generated for the above CPBs. These line samples are then refined with respect to image foreground from different views, screened by some physical properties of human body, and grouped into axes of individual persons. In particular, four equally-spaced rods of 200cm in height are established on each CPB, as shown in Fig. 3.3. For each rod, we back-project it onto each camera view, and inwardly refine its top and bottom (C and D in Fig. 3.3, as well as C and D' calculated using view-invariant cross-ratio) until they are covered by a foreground region. For error tolerance, e.g., to cope with noises and occlusion, the intersection of all the refined 3D rods for each ground location from different camera views is adopted as the final line sample of possible human body.

Based on physical shape/size of a human body, we then apply the rules, as described in Subsection 2.2.1, to filter out incorrect 3D line samples obtained above. Also, the grouping procedure described in Subsection 2.2.2 is applied. Finally, for each group, the average location (maximum height) of the line samples is regarded as a person’s location (height).

3.3 Experiments

To evaluate our methods under different degrees of occlusion, we captured several video sequences of indoor and outdoor scenes. For each scene, calibration pillars are placed vertically and then removed from the scene for the estimation of camera centers, vanishing points, and multiple homographic matrices (see Appendix A). These sequences are captured with different

numbers and trajectories of people. The computation is performed with a PC under Windows 7 with 4 GB RAM and a2.4G Intel Core2 Duo CPU, without using any additional hardware.

Fig. 3.4 shows an instance of scenario S1 captured from four different viewing directions with a 360×240 image resolution. The average distance between the cameras and the monitored area is about 15m. One can see that the lighting conditions are quite complicated. The sun light may come through the windows directly and the reﬂections from the ﬂoor can be seen clearly. A total of 691 frames are captured for S1 wherein eight persons are walking around the ninth one standing near the center of the monitored area.

Figs. 3.5(a) and (b) show 2D line samples generated for Fig. 3.4(b) and the reconstructed 3D MAs, viewing from a slightly higher elevation angle, respectively. In addition, for a closer examination of the correctness of the proposed people localization and height estimation scheme, bounding boxes with a ﬁxed cross-section, and with their height obtained from derived 3D MAs, are back-projected to the captured images, as shown in Fig. 3.5(c) for the image shown in Fig.

3.4(b). One can see that these bounding boxes do overlay nicely with the corresponding individuals. The recall and precision rates for the whole sequence are evaluated as 96.3% and 95.9%, respectively.

Fig. 3.6 shows similar localization results for scenario S2, which has the same people count as that for S1, but the nine people are walking randomly in the scene so that the occlusion among them becomes more serious. As a result, both the recall and precision rates are decreased slightly.

To further examine the robustness of our method under serious occlusion, scenario S3 is evaluated, which is similar to S2 but having twelve persons randomly walking in the scene. Since the scene is becoming more crowded and serious occlusion may occur more frequently, foregrounds of different persons may easily merge into larger regions, as shown in Fig. 3.7(a). While satisfactory localization results are obtained in Figs. 3.7(b) and (c), the recall and precision rates for S3 are decreased to 91.9% and 90.0%, respectively.

The performance of the people localization approach described in this chapter is presented in Table 3.1. The precision and recall rates in all the three scenes are above 90%. Furthermore, the proposed approach achieves very high computational efficiency, even for the crowded scene S3, wherein 12 persons can be located quite accurately at a high processing speed of about 100 fps.

For performance comparison, similar results of people localization obtained in [26] are listed in Table 3.2. One can see that the approach proposed in this chapter achieves similar precision and recall rates as in [26]. However, the processing speed is enhanced (about 2.6 times faster than [26]) due to the use of 3D line samples, instead of reconstructing 3D major axes via computing pairwise

(a) (b) (c) (d) Fig. 3.4. An instance of scenario S1, captured from four different viewing directions.

(a) (b) (c)

Fig. 3.5. Localization results for scenario S1. (a) Segmented foreground regions and 2D line samples for Fig. 3.6(b). (b) 3D major axes to represent different persons in the scene. (c) Localization results illustrated with bounding boxes.

(a) (b) (c) Fig. 3.6. Localization results, similar to those shown in Fig. 3.7, for scenario S2.

(a) (b) (c) Fig. 3.7. Localization results, similar to those shown in Fig. 3.7, for scenario S3.

Table 3.1. Performance of the proposed approach in this chapter.

Sequence Recall Precision Avg. error FPS

S1 96.3% 95.9% 12.16cm 30.74(0.47)

S2 95.2% 95.3% 10.94cm 32.06(0.52)

S3 91.9% 90.0% 11.32cm 23.78(0.41)

Table 3.2. Performance of people localization of [26].

Sequence Recall Precision Avg. error FPS

S1 92.0% 95.7% 11.60cm 11.62(1.008)

S2 94.9% 97.3% 10.00 cm 12.05(1.201)

S3 93.3% 94.3% 10.28 cm 8.34(1.025)

Fig. 3.8. Results of height estimation for S1.

Fig. 3.9. Results of height estimation for S2.

Fig. 3.10. Results of height estimation for S3.

intersections of sample lines of image foreground projected at different heights.

The results of person height estimation for S1 are presented in Fig. 3.8, where red squares indicate the actual heights and blue dots represent the estimated heights together with intervals of unit standard deviations. One can see the errors are less than 5cm. Similar estimation results for S2 can be observed in Fig. 3.9. However, in Fig. 3.10, the results of height estimation of a person (P6) has an error of more than 10cm, which may result from more serious occlusion.

3.4 Summary

We propose an efficient and effective approach for people localization using multiple cameras. Enhanced from [26], we retain the advantage of vanishing point-based line sampling, and develop a 3D line sampling scheme to estimate people locations, instead of reconstructing 3D major axes via computing pairwise intersections of the sample lines at different heights in [26].

The computation cost is greatly reduced. In addition, effective height estimation is also proposed in this chapter. The experiments on crowded scenes, with serious occlusions, also verify the effectiveness and efficiency of the proposed approaches.

Chapter 4 Enhancement of line-based people localization

In this chapter, enhancement of the efficiency of the people localization approach described in Chapter 2 (see also [26]) is considered. The three major improvements include (i) more efficient 3D reconstruction, (ii) more effective filtering of reconstructed 3D line samples, and (iii) the introduction of a view-invariant measure of line correspondence for early screen. While (i) and (ii) are direct improvements/enhancement of the approach presented in Chapter 2, (iii) introduces a new way of measuring the correspondence of two line samples obtained in different views.

4.1 Efficient 3D line construction from intersection of two triangles

While the approach described in Chapter 2 takes a lot computation time to calculate intersection points on multiple reference planes, as shown in Fig. 4.1 (left), an equivalent reconstruction of the 3D axis can actually be obtained by intersecting the two triangles⁸, as shown in Fig. 4.1 (right). By adopting such a method, the computational time, which does not depend on the number of intersection points (reference planes), is expected to be decreased greatly. Axis points can then be estimated by a direct sampling along the 3D axis if necessary.

4.2 Refinement and verification of reconstructed 3D line samples

Although the rules of geometric filtering adopted in Chapter 2 are low-cost and effective, more filtering rules may be included to reduce miss detections. Since the two ends of a 3D line sample reconstructed above may be inaccurate, e.g., due to noise. We propose a refinement procedure to improve their precision. Additionally, two new rules are added, one before and the other after the refinement procedure, to increase the computation speed. Thus, the entire filtering procedure becomes more precise and effective. In particular the following new rule together with Rules 1-2, will be applied to a line sample right after the 3D reconstructoin,

3) The height of its top end point P^{h t} is lower than Tt l. Fig. 4.2(a) shows line samples which survive Rules 1-3.

The main objective of the above three rules is to preserve two kinds of 3D line samples which correspond to (i) the full length of a standing/walking person or (ii) the head and torso of a

8 The camera centers can be found in advance by at least two of the aforementioned four pillars.

23 Fig. 4.1. Illustrations of the simplified 3D reconstruction.

person without his/her feet. By selecting appropriate thresholds, these three rules may also accommodate human activities such as jumping and squatting. In practice, these three rules can efficiently remove most of inappropriate 3D line samples, e.g., 84% of the originally reconstructed 3D line samples for the above example. However, since each 3D line sample is reconstructed by observations from two views only, the top and bottom ends of each 3D line sample may not be very accurate in position. To deal with such a problem, a refinement procedure using information from additional views, as described next, is adopted to find more accurate positions of the two end points before further verification of the 3D line samples are performed.

Conceptually, the refinement scheme is based on the fact that if a 3D line sample corresponds to a real person in the scene, its image in all views should be covered by foreground regions. In other words, its top and bottom end points will be covered by some foreground regions in all views. If that is not the case, the 3D line sample should be shortened until it falls within foreground regions in all views. Specifically, for each 3D line sample, we can use equally spaced sample points between its two ends P^ht and P^hb to form axis samples {P^ht, …, P^hb}⁹ (see (2.1) in Subsection 2.1.3). The refinement for the top end point corresponds to find the first sample point below P^htsuch that it is covered by some foreground regions in all views. Similarly, the refinement of the bottom end point can be done by searching in the upward direction from P^hb.

After such a refinement (shrinking) procedure, Rules 1-3 can be applied again, as well as using another new rule,

4) The height of top end point P^ht is higher than Tth.

to filter out inappropriate 3D line samples. One can see from Fig. 4.2(b) that rough people locations can be distinguished visually from the remaining 3D line samples. Finally, a threshold Tfg is used to filter out 3D line samples which do not have sufficient average foreground coverage rate (AFCR), as shown in Fig. 4.2(c)¹⁰.

9 The interpolation spacing between two adjacent sample points corresponds to a total number of Nplane equally spaced reference planes between the ground plane and the plane with 250cm in height.

10 In our implementation, each sample point of a 3D line sample is projected to all views to check if it is covered by foreground for the computation of AFCR. For example, AFCR for each of the green axes shown in Fig. 2.4 is equal to 100% with respect to all (three) views.

(a) (b) (c)

Fig. 4.2. Filtering results of input images shown in Figs. 2.6(a)-(d). (a) The unverified 3D line samples which survive Rules 1-3, (b) the refined line samples which survive Rules 1-4, (c) final line samples (see text).

4.3 Early screening for line correspondence

In this section, we propose a line correspondence measure of 2D line segments in two different views which is based on a formulation of cross ratio. Such a quantitative measure is view-invariant and can handle line segment of arbitrary configuration in the 3D scene and will be applied to the people localization methods described in Section 4.1 to filtered out non-corresponding line sample pairs before 3D reconstruction. Therefore, the computation speed of the proposed people localization can be further improved. We also convert the formulation to a more efficient form for computational efficiency. While such a measure is first illustrated via the concept of 3D reconstruction, as shown in Fig. 4.3(a), for a better understanding the basic idea, we will show that the measure can actually be computed in either one of the two views.

4.3.1 A view-invariant measure of line correspondence

Assume we have a pair of line samples in View 1 and View 2, respectively, and homographic matrices H1π and H2π between the two views and the ground plane π can be obtained from camera calibration. By projecting the line samples onto plane π, points A, B, C, and D can be obtained along a line in 3D space reconstructed by intersecting two planes each containing a camera center and the corresponding projected line sample. The lengths of AB and CD should be very small if the two line samples correspond to the same 3D line segment. If L^2 is projected to View 1 (as

' 'D

B in Fig. 4.3(b)) where A and C are end points of the line sample obtained in View 1, B and D can be calculated as intersection points of OB' and OD' and the line containing AC,

(a) (b)

Fig. 4.3. (a) Illustration the basic idea of the proposed correspondence measure of two line features (samples). (b) Illustration of a general form of the view-invariant cross ratio.

respectively, with O being the camera center of View 2 which is found in advance.

Instead of using the above lengths, whose values will vary with view points, the view-invariant cross ratio, in one of several forms as discussed in [29], can be used to evaluate the degree of line correspondence as

wherein each one of the four terms represents a signed triangular area in Fig. 4.3(b). If L1 and L2

correspond to a perfect match, points A and B (and points C and D) will coincide, and CR = 0.

and (3.1) can be calculated more efﬁciently by

)

since there is no need to compute B (D) from B′ (D′). Thus, the proposed view-invariant measure of line correspondence, with a zero value representing a perfect match¹¹, can actually be evaluated in either one of the two views by ﬁrst computing the homographic transform, e.g., 1¹

H H₂_ for View 1 in Fig. 4.3(a), of two end points of a candidate line segment in another view.

11 Values other than zero, as well as some special configurations of the above four points, will be considered in next subsection.

4.3.2 Applying the line correspondence measure to improve the efficiency of people localization

In this subsection, we will apply the proposed line correspondence measure to improve the efficiency of people localization Section 4.1¹². Instead of finding correspondence of realistic line features in the scene, we will verify whether 2D line samples from different views belong to the same person. Thus, computations associated with a 3D line sample which are clearly resulted from two line samples of different persons can be avoided. Such computations include (i) 3D reconstruction of 3D line samples, as the mentioned in Section 4.1, (ii) 3D validations and (iii) 2D (foreground) consistency check of the 3D line sample. For example, physical properties of a human body can be used to validate the heights of B and C, and the length of BC in Fig. 4.3(a) for (ii). As for (iii), if a person does exist in the scene, the image of the person should be covered by some foreground regions in all views, so points on each 3D line samples are back projected to all views for further verification. While the complexity of (ii) is very low once (i) is done, (iii) is very expensive since each of the back projection requires a computation of homographic transformation.

Fig. 4.4 shows the procedure of determining whether two line samples obtained from two different views are likely to represent the same person using various parts of (3.3). First, if the denominator of (3.3) is not greater than zero, i.e.,

) ' )(

(OBOC OAOD , (3.4)

the reconstruction from the two line samples will have zero length. Thus, we can conclude that the samples belong to different persons. Except for the special cases, which seldom occur in practice, that one end or both ends of the two 2D line samples are reconstructed coincidentally that the numerator of (3.3) is equal to zero¹³, (3.3) can be evaluated numerically to determine whether the reconstructed 3D line sample may result from the same person(s) that further refinements and verifications, e.g., (ii) and (iii), are needed.

Fig. 4.5 shows two numeric examples of the proposed line correspondence measure for some line samples shown in Figs. 2.6(b) and (d). While a small value (0.0034) is obtained for Fig. 4.5(a) where two line samples correspond to the same person, a larger value (0.0096) is obtained for Fig.

4.5(b) because of occlusion. A threshold of 0.01 is used in the experiments considered next to determine whether |CR| is small enough.

12 This is also true for the approach described in Chapter 2.

13 It is easy to see that in either case, which hardly occurs in practice, additional views are still needed to refine and verify the reconstructed 3D line sample.

Fig. 4.4. Procedure to determine whether two line samples are likely to represent the same person.

(a)

(b)

Fig. 4.5. Illustration of numerical values of the proposed line correspondence measure (see text).

4.4 Experiments

In the following, we show the experiments of improvements described in Sections 4.1 and 4.2, respectively.

4.4.1 Applying the improvements described in Sections 4.1 and 4.2

In this subsection, improvements described in Section 4.1 are evaluated with several different videos taken from both indoor and outdoor scenes, with different degrees of occlusion.

Comparisons with [25] and [26] are also included to show the proposed method can achieve comparable correctness/accuracy in localization but with much higher computation speed.

Additionally, we investigate the performance of the proposed method with different numbers of cameras and densities of line samples in an image.

4.4.1.1 Experiments for different degrees of occlusion with indoor/outdoor sequences

The performance evaluation is implemented under Windows 7 with 4 GB RAM and a 2.4G Intel Core2 Duo CPU, without using any additional hardware. Table 4.1 summarizes detailed localization results of the proposed 3D line reconstruction method as well as three other methods.

In addition to our previous work [26], a modified version¹⁴ of the approach proposed in [25], is also implemented and tested. The proposed approach achieves the highest recall rates for S1-S3 while the other three methods achieve the highest precision rates for the three video sequences.

Similarly, very small difference (within 0.65cm) among results obtained from these three methods can be found for the accuracy of derived people location except for the method proposed in Chapter 3. One can see the 3D line reconstruction method can achieve higher recall and precision rates (+3%) than the method described in Chapter 3 for S3. Overall, the mean value and standard deviation of (x-y) location errors of the proposed method for S1-S3, together, are equal to 10.70 cm and 5.90cm, respectively, which can hopefully be regarded as sufficient for many surveillance applications¹⁵.

As for the computational speed, in frames per second (FPS), the values for different cases listed in Table 4.1 are evaluated without including the cost of foreground segmentation. One can

14 In our implementation, which also does not perform people tracking, binary images of foregrounds are adopted as system input as the other two algorithms. A grid size of 100 × 100 is chosen for each of the twenty reference planes, with 10cm grid spacing. A grid point on the ground is regarded as occupied if more than T_acc = 11 grid points with the same horizontal coordinates (but on reference planes of different heights) correspond to image foreground in all (4) views. Then, connected component analysis is applied to identify connecting occupancy regions. The connected occupancy regions with very small areas, i.e., smaller than 22%

of average area of such regions, are regarded as noise and are removed.

15 The errors are only calculated for correctly detected people locations, which contribute to the precision rates listed in Table 3.1, i.e., with location errors less than 30cm.

Method Recall Precision Mean error

(cm) Frame per second

see that speed-up of more than an order of magnitude from the method in [25] can be achieved by the proposed approach, with as much as 70 times acceleration (near 2.7 times in speed improvement from our previous approach in [26]) in the process speed of S1. While real-time performance can be achieved for S1 and S2, the computation speed is down to a near real-time 21.61 FPS when the number of people increases to twelve¹⁶. Note that for [25] the computation times (in FPS) are about the same for different cases. This is because the time complexity in the generation of synergy maps is mainly depends on the size of each image frame and the total number of views. In addition, the computation speed of the 3D line construction method is quite similar to the method described in Chapter 3. However, the 3D line construction method can achieve higher recall and precision rates if the scene is more crowded, i.e., S2 and S3.

Although the above evaluations show that the proposed method can often provide reasonable

在文檔中以多攝影機進行人物定位 (頁 31-0)