Organization of the thesis - 以多攝影機進行人物定位

Chapter 1 Introduction

1.3 Organization of the thesis

The remainder of this thesis is organized as follows. In Chapter 2, people localization via vanishing points of vertical lines and multiple homographic matrices is proposed. The vanishing points are used to generate 2D line samples of foreground regions in multiple views. Potential people locations are found by project each pair of 2D line samples from different views to the reference planes of different heights via homographic matrices. The intersection points are then connected to form 3D line samples. After that, the 3D line samples are checked against foreground regions of all views and grouped to locate people. Instead of reconstruction in the 3D space, we propose a grid-based approach to efficiently find potential people locations on the ground in Chapter 3. We then generate 3D sample lines for these potential people locations, refine their two ends, and remove those not covered by enough foreground pixels in all views. Additionally, people heights are estimated from the 3D line samples as by-products. In Chapter 4, a more efficient reconstruction method is proposed to improve the people localization approach described in Chapter 2, where reconstruction of 3D line samples takes a lot of computation time to project

1 For example, no additional image processing procedures are performed to identify each individual from a crowd, e.g., through connected component analysis and principal axis analysis as adopted in [21].

2D line samples to multiple reference planes, a more efficient reconstruction approach which reconstructs a 3D line sample as the intersection of two vertical triangles is proposed. In addition, a pre-filtering procedure using a view-invariant measure of line correspondence is also introduced to further improve the efficiency. In Chapter 5, we first review an error analysis method for a pointing system. The idea is then extended and applied to our people localization method described in Chapter 3 to increase the accuracy of localization. Chapter 6 summarizes this thesis.

Chapter 2 Vanishing point-based line sampling for efficient people localization

In this chapter, vanishing point-based line sampling is introduced to increase computation speed of people localization. The vanishing points of vertical lines in the scene in images captured from different viewing angles are used to generate 2D line samples of foreground regions.

Subsequently, 3D line samples of persons can be found efficiently via 3D reconstruction from stereo 2D line sample pairs to avoid pixel-based operations suggested in [23-25].

2.1 Construction of major axes for non-occluded persons from a pair of views

For a better understanding of the basic ideas of the proposed localization, we begin by illustrating how to localize people using the major axes (MA) of the foreground regions in 2D images. Assume the foreground of different persons do not overlap in a pair of views in which the major axis of each of them can be estimated correctly. By projecting these axes, instead of projecting all foreground pixels as in [25], onto multiple reference planes parallel to the ground plane, a 3D axis can be formed for each person by connecting corresponding intersection points of the projected 2D axes on these reference planes. Furthermore, a more efficient scheme is introduced to find the above 3D axis by calculating the intersection line segment of two triangles in the 3D space if the cameras centers can be estimated in advance.

2.1.1 Major axis estimation for a person in an image

In order to segment foreground regions of a person from an image, the Gaussian mixture model (GMM) [27], [28] can be applied. Assume region R obtained from foreground segmentation contains a great percentage of a person, we can estimate the major axis for the person by PCA. An example of an axis thus estimated is shown in Fig. 2.1. One can see that the estimated major axis can represent the elongated shape of a person very well.

2.1.2 Finding a 3D major axis of a person – two approaches

As shown in Fig. 2.2, Let L₁ and L₂ be the axes of a person obtained by PCA for View 1 and View 2, respectively. In addition, let P₁₂^ be the intersection point of the two lines containing the projections of L₁ and L₂, respectively, onto reference (ground) plane π from camera centers C₁ and

7 Fig. 2.1. Detected foreground regions and the estimated axis.

Fig. 2.2. Finding intersection points of two axes on a reference plane.

C2. Ideally, for reference planes of different heights, such intersection points will either (i) belong to both the projected axes, or (ii) stay away from any of them if the corresponding heights are out of the range of the 3D axis. Fig. 2.3 shows samples of the 3D axis thus obtained for the person shown in Fig. 2.1. While intersection points satisfying (i) is colored in black, points not satisfying (i), including those contained in one but not both projected axes due to computation errors, are marked in red².

The above results provide us an important cue to the estimation of a person’s height.

Additionally, one can see that the 2D (horizontal) positions of these 3D points are quite consistent that a roughly vertical major axis (MA) of the person can be constructed by connecting the black points, i.e.,



^b ^t



bh h h

h P P

set

Axis_ ₁_,₂  ₁_,₂,..., ₁_,₂ (2.1) with h_b and b_t being the heights of bottom and top end points of the axis, respectively.

2 To find the above intersection points on reference planes of different heights, a method to produce multiple homographic matrices is introduced which can establish these matrices using only two marker points on each of the four calibrating pillars standing vertically on the ground plane. The detail can be found in Appendix A.

Fig. 2.3. The axis samples of the person shown in Fig. 2.1, which are reconstructed for reference (horizontal) planes with 4 cm spacing and up to 176cm in height.

2.1.3 Extension of finding 3D major axes for non-occluded multiple persons from a pair of views

The above method can be extended to estimate 3D MAs for multiple people if an axis can be found for each of them in two different views. Without knowing the correspondence of the axes in the two views, candidate 3D MAs can be constructed for all possible 2D MA pairs. For example, for M persons in View 1 and N persons in View 2, a total of MN candidate MAs can be Although we do not have correspondences of different people in these two views, it is possible to remove incorrect 3D MAs by checking the consistency in the foreground coverage, as will be explained in Subsection 2.2.1, with additional views. For example, while the two green axes in Fig.

2.4 are correct 3D MAs, the gray axis can be identified as an invalid axis from View 3³.

3 In general, incorrect MAs constructed from a pair of triangles can be removed by checking the consistency with an additional view point (in the 3D space) except for those view points which are coplanar (in a 2D subspace) with one of the two triangles mentioned above. Therefore, with the help of an additional camera, incorrect MAs will be removed completely, with zero probability for the above exceptions.

Fig. 2.4. Illustration of filtering out incorrect 3D MAs by using an extra view.

Fig. 2.5. An example of overlap foreground and the estimated axis.

2.2 Construction of major axes for multiple persons with occlusion

The above 2D PCA-based axis estimation can only cope with situations under which the foreground of a person is separable from others’ in all views, and can be identified as one region by connected component analysis. However, in real applications, many people may appear in a monitored scene at the same time that each segmented foreground area may contain more than one person, as shown in Fig. 2.5, and the aforementioned axes detection approach will not work correctly. One possible solution proposed in [21] is to separate persons by projecting the foreground in the vertical direction to form a histogram, and then determining the boundaries between persons based on the location of peaks and valleys in the histogram, before each person can be represented by one axis for localization and tracking. However, the above approach may not work well when there is a very dense group of people appear in the scene, e.g., for the case shown in Fig. 2.6. For such more complicated situations, instead of estimating a 2D axis for each person, a 3D sampling scheme is proposed in this section wherein 2D line samples of the foreground regions from multiple views are used to generate some 3D line samples of the

(a) (b) (c) (d)

(e) (f)

Fig. 2.6. (a)-(d) 2D line samples in Views 1-4. (e) The unverified 3D line samples which survive Rules 1-2. (f) The results of filtering and grouping.

foreground “volume” based on the same idea described in Section 2.1. Then, with noises filtered out, these 3D line samples are verified with respect to different views by a back projection procedure. Finally, a grouping algorithm is applied to the remaining samples in the scene, before members of each group are integrated into a 3D MA.

2.2.1 Generating 3D line samples using vanishing points

Since the upper bodies of people are almost always perpendicular to the ground plane when they are standing and walking in a monitored scene, we first generate 2D line samples in each view which are originated from the vanishing point of vertical lines in the 3D scene (see Figs.

2.6(a)-(d))⁴. Thus, these 2D line samples correspond to a fan of vertical sampling slices in the 3D space originated from the vertical line containing the corresponding camera center. Note that generating 2D line samples is much faster than the axis estimation discussed in Section 2.1 since no additional image processing is required. The 2D sampling lines having very short lengths (less

4 The vanishing point in each view can be estimated by calculating the intersection points of the four lines extended from the four upright pillars mentioned in Subsection 2.1.2.

than a threshold T_p) will be discarded since they are expected to be far away from a major axis and will have little contribution to the estimation of a 3D MA.

Next, for each pair of views, the remaining 2D line samples are used to reconstruct 3D line samples by the scheme described in Section 2.1. Since there may still be incorrect 3D line samples, such as the gray one shown in Fig. 2.4, two geometric rules can be used to filter out the 3D line samples that will not correctly represent a person in the 3D scene:

1) The length of a 3D line sample is shorter than Tlen, 2) The height of its bottom end point P^hb is higher than Tb.

Fig. 2.6(e) shows 3D line samples passed the two rules, each adjusted slightly so that it is perpendicular to ground plane.

After using the above two filtering rules, we further verify the 3D line samples against image foreground. To check the foreground coverage of a 3D line sample, we back-project its intersection points of different heights to all image views. For a person do appear in the monitored scene, these back-projected points should be covered by some foreground regions. For example, if all back-projected points in all views for a 3D MA are of foreground, its average foreground coverage rate (AFCR) is equal to 100%. A 3D line sample with AFCR lower than T_fg will be removed. Fig. 2.6(f) illustrates the filtering results for line samples shown in Fig. 2.6(e).

2.2.2 Integration of 3D line samples to form 3D major axes

After the above verification procedure, the major axis of a person can be estimated from the remaining 3D line samples using a straightforward grouping algorithm⁵. Specifically, if the 2D horizontal distance between two 3D line samples is closer than a threshold Tc, an edge is established in an undirected graph. After that, we can easily find connecting component areas (3D line sample groups) in the graph. For example, Fig. 2.7(a) shows the input frame for Fig. 2.6(d), and Fig. 2.7(b) shows the undirected graph obtained by the above grouping algorithm, with green points representing the 3D line samples. To avoid some false positives in the grouping, a group containing a total number of 3D line samples less than threshold N_line will be removed.

To locate individual persons, the horizontal position of each of them can be estimated as the average, shown as red stars in Fig. 2.7(b), of the horizontal positions of the 3D line samples in the corresponding group⁶. In Fig. 2.7(c) we show the synergy map obtained with a method modified from [25]. Instead of considering the foreground probability of all image pixels, only those inside of foreground regions are taken into account. One can see the above distribution of each group

5 Detail can be found in [46].

6 The heights of the top and bottom ends of a 3D major axis are assigned as the heights of the highest and lowest end points in the corresponding group, respectively.

12 (a)

(b) (c)

Fig. 2.7. Grouping and localization results. (a) Input frame 532. (b) Grouping sets. (c) Accumulated synergy map of all reference planes.

matches the corresponding occupied region (red color) in the map quite well, i.e., all red stars do fall inside of the occupied regions.

2.3 Experiments

In order to evaluate our method, we used an indoor video with a resolution of 320 × 240.

The spacing between 51 adjacent reference planes was selected as 4cm. In the video, six people are walking along three edges of the tiles on the ground so we can easily evaluate the performance of localization. In Figs. 2.8(a) and (b), the bounding boxes with a fixed cross-section of 50cm x 50cm are back-projected to individual images with their height obtained from derived 3D MAs, shown on the right of the figures with bold lines. One can see that the six persons are well represented with these bounding boxes, and their lo cations having good matches with the specified tracks. For a comparison of computation time with [25], simulation is performed with an implementation based on C language on Windows 7 with, 4 GB RAM and a 2.4G Intel Core2 Duo CPU. Fig. 2.9(a) shows the processing speed, in frame rate per second (FPS), of our method for different portions of the video, with intervals A to F

13 (a)

(b) Fig. 2.8. Localization results for frame 475 and 540.

(a) (b)

Fig. 2.9. Processing speed (in frame rate per second) of (a) Our method. (b) The generation of accumulated synergy map from all reference planes.

corresponding to an increase from 1 to 6 persons in the scene, respectively. One can see that the processing speed varies with people count and more than 2.790 FPS can be achieved when there are six people in the scene. The average is 5.365 FPS. Fig. Fig. 2.9(b) shows the FPS required for the generation of synergy maps, as proposed in [25], which varies much less with time and has an average value of 0.118 FPS. (Note that CUDA adopted in [ 25] is not used here).

This is because its time complexity mainly depends on the size of the whole image but not just the foreground.

2.4 Summary

We proposed a method for people localization which obtains 2D line samples, with each line originated from the vanishing point of vertical lines in the scene, of foreground regions in each view. Geometrically, a pair of line samples obtained from two different views corresponds to a vertical line in the scene. 3D point samples along such a vertical line can then be obtained by projecting the above 2D line samples and identifying their intersecting point on reference planes of different heights, using homographic matrices each associating an image to a reference plane.

Finally, the 3D MA of each person is estimated by grouping 3D line segments derived from point samples satisfying some location and shape constraints. Since the most time-consuming process of homographic projections are performed for line samples instead of the whole image, the proposed approach can achieve near-real time performance for localization accuracies similar to that in [25].

Chapter 3 Acceleration of vanishing point-based line sampling scheme for people localization and height estimation via footstep analysis

In this chapter, the efficiency of the above line sample-based approach is further improved by considering only one reference (ground) plane and, without performing 3D reconstruction, adopting a 3D line sampling scheme. Fig. 3.1 illustrates the schematic diagram of the proposed framework. First, the preprocessing procedures of camera calibration and foreground segmentation are executed. Next, we generate lines originated from the vanishing point of vertical lines in the scene to sample the foreground objects (people) in each camera view, as in [26]. The line samples of foreground objects from all camera views are then projected onto the ground plane via homography, with regions crossed through by a large number of projected sample lines identified as candidate people regions. We then generate (vertical) 3D sample lines for these candidate people regions, refine their two ends, and remove those not covered by enough foreground pixels in all views. Finally, the remaining 3D sample lines are grouped into individual axes to indicate people locations. Additionally, the height of each person can also be estimated as by-product.

3.1 Finding candidate people regions (blocks)

According Fig. 3.1, we first generate 2D sample lines, originated from the vanishing point, of foreground regions in each camera view. The sample lines containing very few foreground pixels are discarded since they contribute little to the following localization process. Then, the remaining sample lines are projected onto the ground plane via homography. It is easy to see that the more a region is crossed through by the projected sample lines, the more likely the region contains a person. Thus, we discretize the ground plane into a grid of 50cm  50cm blocks, each has about the area a standing person occupies, and count the number of crossing sample lines for each block.

However, the above line counts may distribute across neighboring blocks, as shown in Fig.

3.2(a). Thus, we add a second grid, which has an offset of 25cm in both X and Y directions (on the ground plane) from the first one. Note that the second grid can have higher counts in some grids for the above example, as shown Fig. 3.2(b). After merging the two layers of grids, we retain the higher count for each quarter block, as illustrated in Fig. 3.2(c). Finally, the quarter blocks whose counts are greater than a threshold Tcn7

are identified as candidate people blocks (CPBs).

7 We set Tcn =8, which means the block is crossed through by sample lines from at least two camera views.

Fig. 3.1. Schematic diagram of the proposed people localization framework.

(a) (b) (c)

Fig. 3.2. Finding candidate people blocks (CPBs) by two-layered grids. (a) Layer 1 grid. (b) Layer 2 grid.

17 Fig. 3.3. Building and refining 3D virtual rods.

3.2 People localization and height estimation

In this section, to achieve the goal of people localization and height estimation, vertical line samples of human body are generated for the above CPBs. These line samples are then refined with respect to image foreground from different views, screened by some physical properties of human body, and grouped into axes of individual persons. In particular, four equally-spaced rods of 200cm in height are established on each CPB, as shown in Fig. 3.3. For each rod, we back-project it onto each camera view, and inwardly refine its top and bottom (C and D in Fig. 3.3, as well as C and D' calculated using view-invariant cross-ratio) until they are covered by a foreground region. For error tolerance, e.g., to cope with noises and occlusion, the intersection of all the refined 3D rods for each ground location from different camera views is adopted as the final line sample of possible human body.

Based on physical shape/size of a human body, we then apply the rules, as described in Subsection 2.2.1, to filter out incorrect 3D line samples obtained above. Also, the grouping procedure described in Subsection 2.2.2 is applied. Finally, for each group, the average location (maximum height) of the line samples is regarded as a person’s location (height).

3.3 Experiments

To evaluate our methods under different degrees of occlusion, we captured several video sequences of indoor and outdoor scenes. For each scene, calibration pillars are placed vertically and then removed from the scene for the estimation of camera centers, vanishing points, and multiple homographic matrices (see Appendix A). These sequences are captured with different

numbers and trajectories of people. The computation is performed with a PC under Windows 7 with 4 GB RAM and a2.4G Intel Core2 Duo CPU, without using any additional hardware.

Fig. 3.4 shows an instance of scenario S1 captured from four different viewing directions with a 360×240 image resolution. The average distance between the cameras and the monitored area is about 15m. One can see that the lighting conditions are quite complicated. The sun light may come through the windows directly and the reﬂections from the ﬂoor can be seen clearly. A

在文檔中以多攝影機進行人物定位 (頁 18-0)