Summary - Vanishing point-based line sampling for efficient people localization

Chapter 2 Vanishing point-based line sampling for efficient people localization

2.4 Summary

We proposed a method for people localization which obtains 2D line samples, with each line originated from the vanishing point of vertical lines in the scene, of foreground regions in each view. Geometrically, a pair of line samples obtained from two different views corresponds to a vertical line in the scene. 3D point samples along such a vertical line can then be obtained by projecting the above 2D line samples and identifying their intersecting point on reference planes of different heights, using homographic matrices each associating an image to a reference plane.

Finally, the 3D MA of each person is estimated by grouping 3D line segments derived from point samples satisfying some location and shape constraints. Since the most time-consuming process of homographic projections are performed for line samples instead of the whole image, the proposed approach can achieve near-real time performance for localization accuracies similar to that in [25].

Chapter 3 Acceleration of vanishing point-based line sampling scheme for people localization and height estimation via footstep analysis

In this chapter, the efficiency of the above line sample-based approach is further improved by considering only one reference (ground) plane and, without performing 3D reconstruction, adopting a 3D line sampling scheme. Fig. 3.1 illustrates the schematic diagram of the proposed framework. First, the preprocessing procedures of camera calibration and foreground segmentation are executed. Next, we generate lines originated from the vanishing point of vertical lines in the scene to sample the foreground objects (people) in each camera view, as in [26]. The line samples of foreground objects from all camera views are then projected onto the ground plane via homography, with regions crossed through by a large number of projected sample lines identified as candidate people regions. We then generate (vertical) 3D sample lines for these candidate people regions, refine their two ends, and remove those not covered by enough foreground pixels in all views. Finally, the remaining 3D sample lines are grouped into individual axes to indicate people locations. Additionally, the height of each person can also be estimated as by-product.

3.1 Finding candidate people regions (blocks)

According Fig. 3.1, we first generate 2D sample lines, originated from the vanishing point, of foreground regions in each camera view. The sample lines containing very few foreground pixels are discarded since they contribute little to the following localization process. Then, the remaining sample lines are projected onto the ground plane via homography. It is easy to see that the more a region is crossed through by the projected sample lines, the more likely the region contains a person. Thus, we discretize the ground plane into a grid of 50cm  50cm blocks, each has about the area a standing person occupies, and count the number of crossing sample lines for each block.

However, the above line counts may distribute across neighboring blocks, as shown in Fig.

3.2(a). Thus, we add a second grid, which has an offset of 25cm in both X and Y directions (on the ground plane) from the first one. Note that the second grid can have higher counts in some grids for the above example, as shown Fig. 3.2(b). After merging the two layers of grids, we retain the higher count for each quarter block, as illustrated in Fig. 3.2(c). Finally, the quarter blocks whose counts are greater than a threshold Tcn7

are identified as candidate people blocks (CPBs).

7 We set Tcn =8, which means the block is crossed through by sample lines from at least two camera views.

Fig. 3.1. Schematic diagram of the proposed people localization framework.

(a) (b) (c)

Fig. 3.2. Finding candidate people blocks (CPBs) by two-layered grids. (a) Layer 1 grid. (b) Layer 2 grid.

17 Fig. 3.3. Building and refining 3D virtual rods.

3.2 People localization and height estimation

In this section, to achieve the goal of people localization and height estimation, vertical line samples of human body are generated for the above CPBs. These line samples are then refined with respect to image foreground from different views, screened by some physical properties of human body, and grouped into axes of individual persons. In particular, four equally-spaced rods of 200cm in height are established on each CPB, as shown in Fig. 3.3. For each rod, we back-project it onto each camera view, and inwardly refine its top and bottom (C and D in Fig. 3.3, as well as C and D' calculated using view-invariant cross-ratio) until they are covered by a foreground region. For error tolerance, e.g., to cope with noises and occlusion, the intersection of all the refined 3D rods for each ground location from different camera views is adopted as the final line sample of possible human body.

Based on physical shape/size of a human body, we then apply the rules, as described in Subsection 2.2.1, to filter out incorrect 3D line samples obtained above. Also, the grouping procedure described in Subsection 2.2.2 is applied. Finally, for each group, the average location (maximum height) of the line samples is regarded as a person’s location (height).

3.3 Experiments

To evaluate our methods under different degrees of occlusion, we captured several video sequences of indoor and outdoor scenes. For each scene, calibration pillars are placed vertically and then removed from the scene for the estimation of camera centers, vanishing points, and multiple homographic matrices (see Appendix A). These sequences are captured with different

numbers and trajectories of people. The computation is performed with a PC under Windows 7 with 4 GB RAM and a2.4G Intel Core2 Duo CPU, without using any additional hardware.

Fig. 3.4 shows an instance of scenario S1 captured from four different viewing directions with a 360×240 image resolution. The average distance between the cameras and the monitored area is about 15m. One can see that the lighting conditions are quite complicated. The sun light may come through the windows directly and the reﬂections from the ﬂoor can be seen clearly. A total of 691 frames are captured for S1 wherein eight persons are walking around the ninth one standing near the center of the monitored area.

Figs. 3.5(a) and (b) show 2D line samples generated for Fig. 3.4(b) and the reconstructed 3D MAs, viewing from a slightly higher elevation angle, respectively. In addition, for a closer examination of the correctness of the proposed people localization and height estimation scheme, bounding boxes with a ﬁxed cross-section, and with their height obtained from derived 3D MAs, are back-projected to the captured images, as shown in Fig. 3.5(c) for the image shown in Fig.

3.4(b). One can see that these bounding boxes do overlay nicely with the corresponding individuals. The recall and precision rates for the whole sequence are evaluated as 96.3% and 95.9%, respectively.

Fig. 3.6 shows similar localization results for scenario S2, which has the same people count as that for S1, but the nine people are walking randomly in the scene so that the occlusion among them becomes more serious. As a result, both the recall and precision rates are decreased slightly.

To further examine the robustness of our method under serious occlusion, scenario S3 is evaluated, which is similar to S2 but having twelve persons randomly walking in the scene. Since the scene is becoming more crowded and serious occlusion may occur more frequently, foregrounds of different persons may easily merge into larger regions, as shown in Fig. 3.7(a). While satisfactory localization results are obtained in Figs. 3.7(b) and (c), the recall and precision rates for S3 are decreased to 91.9% and 90.0%, respectively.

The performance of the people localization approach described in this chapter is presented in Table 3.1. The precision and recall rates in all the three scenes are above 90%. Furthermore, the proposed approach achieves very high computational efficiency, even for the crowded scene S3, wherein 12 persons can be located quite accurately at a high processing speed of about 100 fps.

For performance comparison, similar results of people localization obtained in [26] are listed in Table 3.2. One can see that the approach proposed in this chapter achieves similar precision and recall rates as in [26]. However, the processing speed is enhanced (about 2.6 times faster than [26]) due to the use of 3D line samples, instead of reconstructing 3D major axes via computing pairwise

(a) (b) (c) (d) Fig. 3.4. An instance of scenario S1, captured from four different viewing directions.

(a) (b) (c)

Fig. 3.5. Localization results for scenario S1. (a) Segmented foreground regions and 2D line samples for Fig. 3.6(b). (b) 3D major axes to represent different persons in the scene. (c) Localization results illustrated with bounding boxes.

(a) (b) (c) Fig. 3.6. Localization results, similar to those shown in Fig. 3.7, for scenario S2.

(a) (b) (c) Fig. 3.7. Localization results, similar to those shown in Fig. 3.7, for scenario S3.

Table 3.1. Performance of the proposed approach in this chapter.

Sequence Recall Precision Avg. error FPS

S1 96.3% 95.9% 12.16cm 30.74(0.47)

S2 95.2% 95.3% 10.94cm 32.06(0.52)

S3 91.9% 90.0% 11.32cm 23.78(0.41)

Table 3.2. Performance of people localization of [26].

Sequence Recall Precision Avg. error FPS

S1 92.0% 95.7% 11.60cm 11.62(1.008)

S2 94.9% 97.3% 10.00 cm 12.05(1.201)

S3 93.3% 94.3% 10.28 cm 8.34(1.025)

Fig. 3.8. Results of height estimation for S1.

Fig. 3.9. Results of height estimation for S2.

Fig. 3.10. Results of height estimation for S3.

intersections of sample lines of image foreground projected at different heights.

The results of person height estimation for S1 are presented in Fig. 3.8, where red squares indicate the actual heights and blue dots represent the estimated heights together with intervals of unit standard deviations. One can see the errors are less than 5cm. Similar estimation results for S2 can be observed in Fig. 3.9. However, in Fig. 3.10, the results of height estimation of a person (P6) has an error of more than 10cm, which may result from more serious occlusion.

3.4 Summary

We propose an efficient and effective approach for people localization using multiple cameras. Enhanced from [26], we retain the advantage of vanishing point-based line sampling, and develop a 3D line sampling scheme to estimate people locations, instead of reconstructing 3D major axes via computing pairwise intersections of the sample lines at different heights in [26].

The computation cost is greatly reduced. In addition, effective height estimation is also proposed in this chapter. The experiments on crowded scenes, with serious occlusions, also verify the effectiveness and efficiency of the proposed approaches.

Chapter 4 Enhancement of line-based people localization

In this chapter, enhancement of the efficiency of the people localization approach described in Chapter 2 (see also [26]) is considered. The three major improvements include (i) more efficient 3D reconstruction, (ii) more effective filtering of reconstructed 3D line samples, and (iii) the introduction of a view-invariant measure of line correspondence for early screen. While (i) and (ii) are direct improvements/enhancement of the approach presented in Chapter 2, (iii) introduces a new way of measuring the correspondence of two line samples obtained in different views.

4.1 Efficient 3D line construction from intersection of two triangles

While the approach described in Chapter 2 takes a lot computation time to calculate intersection points on multiple reference planes, as shown in Fig. 4.1 (left), an equivalent reconstruction of the 3D axis can actually be obtained by intersecting the two triangles⁸, as shown in Fig. 4.1 (right). By adopting such a method, the computational time, which does not depend on the number of intersection points (reference planes), is expected to be decreased greatly. Axis points can then be estimated by a direct sampling along the 3D axis if necessary.

4.2 Refinement and verification of reconstructed 3D line samples

Although the rules of geometric filtering adopted in Chapter 2 are low-cost and effective, more filtering rules may be included to reduce miss detections. Since the two ends of a 3D line sample reconstructed above may be inaccurate, e.g., due to noise. We propose a refinement procedure to improve their precision. Additionally, two new rules are added, one before and the other after the refinement procedure, to increase the computation speed. Thus, the entire filtering procedure becomes more precise and effective. In particular the following new rule together with Rules 1-2, will be applied to a line sample right after the 3D reconstructoin,

3) The height of its top end point P^{h t} is lower than Tt l. Fig. 4.2(a) shows line samples which survive Rules 1-3.

The main objective of the above three rules is to preserve two kinds of 3D line samples which correspond to (i) the full length of a standing/walking person or (ii) the head and torso of a

8 The camera centers can be found in advance by at least two of the aforementioned four pillars.

23 Fig. 4.1. Illustrations of the simplified 3D reconstruction.

person without his/her feet. By selecting appropriate thresholds, these three rules may also accommodate human activities such as jumping and squatting. In practice, these three rules can efficiently remove most of inappropriate 3D line samples, e.g., 84% of the originally reconstructed 3D line samples for the above example. However, since each 3D line sample is reconstructed by observations from two views only, the top and bottom ends of each 3D line sample may not be very accurate in position. To deal with such a problem, a refinement procedure using information from additional views, as described next, is adopted to find more accurate positions of the two end points before further verification of the 3D line samples are performed.

Conceptually, the refinement scheme is based on the fact that if a 3D line sample corresponds to a real person in the scene, its image in all views should be covered by foreground regions. In other words, its top and bottom end points will be covered by some foreground regions in all views. If that is not the case, the 3D line sample should be shortened until it falls within foreground regions in all views. Specifically, for each 3D line sample, we can use equally spaced sample points between its two ends P^ht and P^hb to form axis samples {P^ht, …, P^hb}⁹ (see (2.1) in Subsection 2.1.3). The refinement for the top end point corresponds to find the first sample point below P^htsuch that it is covered by some foreground regions in all views. Similarly, the refinement of the bottom end point can be done by searching in the upward direction from P^hb.

After such a refinement (shrinking) procedure, Rules 1-3 can be applied again, as well as using another new rule,

4) The height of top end point P^ht is higher than Tth.

to filter out inappropriate 3D line samples. One can see from Fig. 4.2(b) that rough people locations can be distinguished visually from the remaining 3D line samples. Finally, a threshold Tfg is used to filter out 3D line samples which do not have sufficient average foreground coverage rate (AFCR), as shown in Fig. 4.2(c)¹⁰.

9 The interpolation spacing between two adjacent sample points corresponds to a total number of Nplane equally spaced reference planes between the ground plane and the plane with 250cm in height.

10 In our implementation, each sample point of a 3D line sample is projected to all views to check if it is covered by foreground for the computation of AFCR. For example, AFCR for each of the green axes shown in Fig. 2.4 is equal to 100% with respect to all (three) views.

(a) (b) (c)

Fig. 4.2. Filtering results of input images shown in Figs. 2.6(a)-(d). (a) The unverified 3D line samples which survive Rules 1-3, (b) the refined line samples which survive Rules 1-4, (c) final line samples (see text).

4.3 Early screening for line correspondence

In this section, we propose a line correspondence measure of 2D line segments in two different views which is based on a formulation of cross ratio. Such a quantitative measure is view-invariant and can handle line segment of arbitrary configuration in the 3D scene and will be applied to the people localization methods described in Section 4.1 to filtered out non-corresponding line sample pairs before 3D reconstruction. Therefore, the computation speed of the proposed people localization can be further improved. We also convert the formulation to a more efficient form for computational efficiency. While such a measure is first illustrated via the concept of 3D reconstruction, as shown in Fig. 4.3(a), for a better understanding the basic idea, we will show that the measure can actually be computed in either one of the two views.

4.3.1 A view-invariant measure of line correspondence

Assume we have a pair of line samples in View 1 and View 2, respectively, and homographic matrices H1π and H2π between the two views and the ground plane π can be obtained from camera calibration. By projecting the line samples onto plane π, points A, B, C, and D can be obtained along a line in 3D space reconstructed by intersecting two planes each containing a camera center and the corresponding projected line sample. The lengths of AB and CD should be very small if the two line samples correspond to the same 3D line segment. If L^2 is projected to View 1 (as

' 'D

B in Fig. 4.3(b)) where A and C are end points of the line sample obtained in View 1, B and D can be calculated as intersection points of OB' and OD' and the line containing AC,

(a) (b)

Fig. 4.3. (a) Illustration the basic idea of the proposed correspondence measure of two line features (samples). (b) Illustration of a general form of the view-invariant cross ratio.

respectively, with O being the camera center of View 2 which is found in advance.

Instead of using the above lengths, whose values will vary with view points, the view-invariant cross ratio, in one of several forms as discussed in [29], can be used to evaluate the degree of line correspondence as

wherein each one of the four terms represents a signed triangular area in Fig. 4.3(b). If L1 and L2

correspond to a perfect match, points A and B (and points C and D) will coincide, and CR = 0.

and (3.1) can be calculated more efﬁciently by

)

since there is no need to compute B (D) from B′ (D′). Thus, the proposed view-invariant measure of line correspondence, with a zero value representing a perfect match¹¹, can actually be evaluated in either one of the two views by ﬁrst computing the homographic transform, e.g., 1¹

H H₂_ for View 1 in Fig. 4.3(a), of two end points of a candidate line segment in another view.

11 Values other than zero, as well as some special configurations of the above four points, will be considered in next subsection.

4.3.2 Applying the line correspondence measure to improve the efficiency of people localization

In this subsection, we will apply the proposed line correspondence measure to improve the efficiency of people localization Section 4.1¹². Instead of finding correspondence of realistic line features in the scene, we will verify whether 2D line samples from different views belong to the same person. Thus, computations associated with a 3D line sample which are clearly resulted from two line samples of different persons can be avoided. Such computations include (i) 3D reconstruction of 3D line samples, as the mentioned in Section 4.1, (ii) 3D validations and (iii) 2D (foreground) consistency check of the 3D line sample. For example, physical properties of a human body can be used to validate the heights of B and C, and the length of BC in Fig. 4.3(a) for (ii). As for (iii), if a person does exist in the scene, the image of the person should be covered by some foreground regions in all views, so points on each 3D line samples are back projected to all views for further verification. While the complexity of (ii) is very low once (i) is done, (iii) is very expensive since each of the back projection requires a computation of homographic transformation.

Fig. 4.4 shows the procedure of determining whether two line samples obtained from two different views are likely to represent the same person using various parts of (3.3). First, if the denominator of (3.3) is not greater than zero, i.e.,

) ' )(

(OBOC OAOD , (3.4)

the reconstruction from the two line samples will have zero length. Thus, we can conclude that the samples belong to different persons. Except for the special cases, which seldom occur in practice, that one end or both ends of the two 2D line samples are reconstructed coincidentally that the numerator of (3.3) is equal to zero¹³, (3.3) can be evaluated numerically to determine whether the reconstructed 3D line sample may result from the same person(s) that further refinements and verifications, e.g., (ii) and (iii), are needed.

Fig. 4.5 shows two numeric examples of the proposed line correspondence measure for some line samples shown in Figs. 2.6(b) and (d). While a small value (0.0034) is obtained for Fig. 4.5(a) where two line samples correspond to the same person, a larger value (0.0096) is obtained for Fig.

4.5(b) because of occlusion. A threshold of 0.01 is used in the experiments considered next to determine whether |CR| is small enough.

12 This is also true for the approach described in Chapter 2.

13 It is easy to see that in either case, which hardly occurs in practice, additional views are still needed to refine and verify the reconstructed 3D line sample.

Fig. 4.4. Procedure to determine whether two line samples are likely to represent the same person.

(a)

(b)

Fig. 4.5. Illustration of numerical values of the proposed line correspondence measure (see text).

4.4 Experiments

In the following, we show the experiments of improvements described in Sections 4.1 and

在文檔中以多攝影機進行人物定位 (頁 28-0)