Applying the improvements described in Section 4.3

Chapter 4 Enhancement of line-based people localization

4.3 Early screening for line correspondence

4.4.2 Applying the improvements described in Section 4.3

For the screening procedure shown in Fig. 4.4, the filtering results for S1-S3 are shown in Table. 4.4. One can see the first step, by evaluating (3.4) only, can already filter out about 50% of total line sample pairs, and only about 7% line samples will be reconstructed for further processing. The evaluation is performed with a PC with 4 GB RAM and a 2.4G Intel i5 M520 CPU. As for the overall performance in people localization, one can see from Table. 4.5 that while recall, precision, and localization errors of the accelerated method are comparable with that in [26], the execution speed is more than three times of that in [26] for all sequences. We also applied the line correspondence measure to improve the efficiency of the method described in Section 4.1.

Table 4.6 shows less acceleration compared to Table 4.5 because the computation cost of the rest processes of the method described in Section 4.1 is much lower than those in [26].

22 For example, while localization-based people tracking is often needed in intruder detection and abnormal behavior detection, if such functions are to be implemented with no special hardware for acceleration, our approach will have better chance of fulfilling the requirement of real-time performance than that presented in [25]. As another example, effective people tracking based on the localization results may need to be developed for similar applications, which may be more sophisticated than that presented in [25] and implemented without any special hardware.

23 To that end, constraints for human standing on the ground plane should be removed, which include Rules 2-4

Table 4.4. Filtering results of Fig. 4.4.

Sequence Total pairs No match pairs Reconstructed pairs

S1 8282914 52.3% 6.3%

S2 8550765 48.9% 7.7%

S3 4298920 48.2% 6.7%

Table 4.5. Localization results of sequences S1-S3.

Sequence Method Recall Precision Mean error FPS

Lo [26] 93.7% 95.1% 11.07 26.69

Lo [49] (Lo [26] + Sec. 4.3) 94.8% 95.1% 11.05 83.46  (3.12x)

Lo [26] 94.6% 94.2% 9.57 26.33

Lo [49] (Lo [26] + Sec. 4.3) 97.0% 93.1% 9.53 76.60  (2.90x)

Lo [26] 92.3% 91.9% 9.57 18.09

Lo [49] (Lo [26] + Sec. 4.3) 91.7% 95.6% 9.87 63.79  (3.50x)

Table 4.6. Localization results of the method proposed in Sections 4.1, 4.2, and 4.3.

Sequence Method Recall Precision Mean error FPS

Lo [48] (Secs. 4.1 & 4.2) 96.50% 95.60% 11.42(5.89) 127.80

Lo [48, 49] (Secs. 4.1, 4.2, and 4.3) 93.83% 95.53% 11.25(5.93) 186.91(1.46x)

Lo [48] (Secs. 4.1 & 4.2) 96.80% 97.00% 10.09(5.77) 121.93

Lo [48, 49] (Secs. 4.1, 4.2, and 4.3) 96.35% 96.69% 9.82 (5.57) 173.73(1.42x)

Lo [48] (Secs. 4.1 & 4.2) 95.20% 93.60% 10.55(6.01) 86.71

Lo [48, 49] (Secs. 4.1, 4.2, and 4.3) 93.60% 94.53% 10.59(6.01) 140.12 1.61x) Fig. 4.11. A more challenging localization example for a busy street scene. (a)-(d) The localization results

(illustrated with bounding boxes) of four views. (e)-(h) Corresponding foreground regions and 2D line samples. (i) 3D line samples to represent different persons in the scene.

4.4.2.1 Summary of the experiments

A correspondence measure of 2D line segments in two different views is proposed. Such a measure can handle line segment of arbitrary configuration in the 3D scene and is view-invariant, i.e., same measurement can be obtained quantitatively from either one of a pair of views. Besides we also proposed a line-based people localization scheme by applying such a measure to improve the efficiency of the method described in [26] and Section 4.1. By verifying whether 2D line samples from different views belong to the same person, computations associated with invalid 3D line samples, which are resulted from different persons, can be avoided. Experiments are performed for videos of crowded scenes with various degrees of occlusion. Overall, people localization results, in terms of correctness and accuracy, comparable to the two localization methods can be obtained, but with more than 1.42 times increase in computation speed. Other applications of the proposed line correspondence measure, e.g., in robot SLAM, are currently under investigation.

4.5 Summary

In order to enhance the efficiency of [25], we propose a vanishing point-based line sampling technique in [26] (Chapter 2). While the main idea of the approach presented in [25] is to project dense 2D samples (image pixels) onto multiple (horizontal) planar surfaces in the 3D space (before these data are fused into 3D object distributions), it is simplified in [26] by projecting 1D image samples²⁴, i.e., lines passing through the vanishing point of vertical lines in the 3D space, instead (before their intersections are grouped into 3D line samples of the crowd through grouping). To further improve the efficiency of people localization, a novel approach is proposed in this chapter which projects the above line samples directly into the 3D space, i.e., along a fan of vertical planes originated from the vertical axis containing the camera center, to generate possible 1D (vertical line) samples of the 3D object²⁵. Since realistic constraints of a human body can be adopted to refine and to verify these object samples, localization results compatible with those in [25] can be achieved, but with more than an order of magnitude in processing speed. Furthermore, we proposed a view-invariant correspondence measure of line segments in different images in Section 4.3 to improve the efficiency of the method proposed in [26] and Section 4.1.

24 In the rest of this thesis, we will referred to these samples as 2D line samples.

25 In the rest of this thesis, we will referred to these samples as 3D line samples.

Chapter 5 Error analysis of 3D line reconstruction from intersection of two triangles

In Chapter 3, we proposed a people localization method which is based on 3D line reconstruction from intersection of two triangles and can achieve high recall and precision rates.

An empirical error analysis scheme for similar 3D line reconstruction is developed in this chapter for a simple pointing system. The related error analysis results are expected to further improve the accuracy for the people localization method mentioned above.

5.1 Motivation

For many HCI applications, pointing directions of a user can be transformed conveniently into instructions such as asking a robot to move to desired positions or controlling a computer by a virtual mouse. While real-time computations of the pointing direction (and its target) for a user are often needed, accuracy and stability of the computation are the most desirable attributes of such pointing systems.

In some pointing systems, human hands are exploited to give instruction via associated direction vectors. For example, the connected line from the finger root to the fingertip is recognized as a pointing direction in [30], while the pointing direction is connected from head to hand in [31]. Similarly, one eye and one fingertip are consider to form a direction vector in [32], while similar vector is established by connecting a line from shoulder to arm in [33]. Instead of using skin color to detect pointing direction of a human hand, as in [31, 32], motion analysis of feature points of user’s hand is adopted to estimate the shoulder point and the direction vector in [34]. In [35], a vision-based method is proposed to find the pointing directions which are extended from head to hand. In addition, artificial neural networks are used to find head orientation to improve the accuracy of pointing results. In general, to locate the pointing position in a 3D environment, some forms of 3D reconstruction need to be carried out to determine the direction vector. In [32, 33, 35, 36], 3D voxels of a pair feature points used in the pointing are calculated before such a vector is formed.

In order to study the accuracy and stability of pointing, a real-time, vision-based system similar to that presented in [31] but with pointing direction specified by a pointer is implemented (see Fig. 5.1). By considering the intersection of planes in the 3D world, the system first calculates two planes each formed by two endpoints of the pointer and the center of one of the two cameras.

The intersection of these two planes then forms the direction vector. Instead of explicitly deriving

Fig. 5.1. Configuration of the pointing system and the reconstruction of a pointing point.

intrinsic and extrinsic camera parameters, the approach only needs the camera positions, and needs to calibrate the homographies, providing distortion in the camera from perspective projection is fixed.

For all pointing systems, different forms of measurement and computation errors can be generated during the reconstruction of the pointing line, which has five degrees of freedom, and a clear understanding of these errors may greatly improve the applicability of such systems.

However, existing error analysis schemes are mainly concerned with planar localization, based on image data acquired by a single camera [37-41], as well as reconstruction of 3D point features using stereo cameras [42-45], which only have two/three degrees of freedom. In the following, an efficient error analysis scheme is established for an experimental pointing system by evaluating the error range of pointing results on a projection plane, e.g., a screen, with image data assumed to be corrupted by additive noises, as in some of the above approaches. Hopefully, with the help of such analysis, more robust pointing results can be achieved by selecting the most appropriate pointer positions, or pairs of cameras, that will result in minimal range of pointing error. In addition, more accurate people localization method can be achieved. While a pointer with bright color is used here to greatly reduce the influence of certain sources of error, e.g., those due to errors in image feature extraction, the error analysis results will provide an upper bound of pointing accuracy for systems using different pointers, e.g., those discussed in [30-33, 35, 36], if similar reconstruction process is adopted.

Fig. 5.2. Noise circles (simulated points) for the pointer endpoints located in stereo images shown in Fig.

5.1, and their CICTs (see text).

5.2 An experimental pointing system

In this section we describe the configuration of a simple experimental pointing system used in this thesis which is similar to [31] but using homographic transformations to derive the pointer direction. The system uses two cameras mounted on the ceiling, four reference points on the floor, and a projection plane perpendicular to the ground (see Fig. 5.1).²⁶ A two-fold simplification is associated with such a pointing system. First, unlike in [30], [32] which use color and brightness to find hand region, we use a pointer with bright color to reduce the complexity in feature extraction. Second, unlike in [31] and [33], the simple camera calibration similar to that used in [30] is adopted for 3D reconstruction based on homographic transformations. With such a simplified system configuration, the errors generated during the reconstruction process can be studied more easily and understood more clearly.

In the proposed approach, the left and right images are acquired simultaneously from the two cameras. For each of the stereo images, the image pixels of the pointer are obtained through a preprocessing step (see Appendix), and we calculate a best-fit line of these pixels via principal components analysis (PCA). The line intersects the bounding box of the above image pixels at two points, which are then regarded as (extended) endpoints of the pointer in the image. In this section, the two sets of pointer endpoints are denoted as {I_LS, I_LE} and {I_RS, I_RE} for the left and right images, respectively.

26 The coordinates (in cm) of the two cameras CL and CR are (192, 365, 264) and (493, 122, 264), and the coordinate of the four corners of the projection plane are (115, 0, 243), (115, 0, 108), (295, 0, 108), and (295, 0, 243). In general, the pointing system can be u sed to identify a non-planar object at various locations in the 3D space. The projection plane is included here for demonstration purpose only.

Once the positions of the above endpoints are located in the left and right images, we use homographic transformation to find their projections, R_LS, R_LE, R_RS, and R_RE on the ground plane, as shown in Fig. 5.1.²⁷ Thus, plane πL which contains RLS, RLE, and the center of the left camera CL, and plane πR which contains RRS, RRE, and CR can be reconstructed. Planes πL, πR, and the projection plane πP will then intersect at the pointing position P. Finally, we transform P into the 2D coordinate of the monitor screen through another homographic transformation, and display the reconstructed pointing position (RPP).

With the above simple reconstruction process (see Appendix), there is no need to find all camera parameters, as required in typical 3D reconstruction approaches, and the pointing system can operate efficiently in real-time. However, noises in the imaging process may result in reconstruction errors and thus unstable pointing position. To understand the influence of such undesirable effects, and hopefully to develop a scheme to reduce the influence accordingly, an static and (ii) dynamic errors. Static errors such as digitization, lens distortion and measurement errors are almost unavoidable. For example, when we determine the positions of four reference points on the ground and image planes, for calculating the transformation matrix between the two planes, computation or measurement errors may occur. Such errors can be corrected by an additional homographic transformation, and may even be unnoticeable to a user in reality because of the simultaneous self-adjusting ability resulted from the visual feedback during the pointing operation. However, dynamic errors may cause obvious jitters in RPP, which are usually unacceptable. Thus, the error analysis discussed in this chapter will focus on (ii).

There are several sources of the dynamic errors, and a major one is due to noises associated with image acquisition. For example, pixels of the pointer region are identified in each of the stereo images before the PCA is performed; however, size and shape of the region may change with time because of illumination changes and influences of noise from the camera sensors²⁸. In

27 The corresponding transformations, HL (for ILX → RLX, X = S, E) and HR, are found in advance by using positions of four reference points marked on the floor (not shown in Fig. 1), and their positions in the stereo images.

28 Influences from more complex situations, e.g., when the pointer’s color is close to the background, are not considered in this chapter since highly dynamic segmentation errors of the pointer due to pointer-background interaction may be so large that the error analysis of the RPPs will make no sense. (Similarly, extraction of reference points in the system calibration stage is also assumed to be free of such complex situations.) In general, more involved segmentation schemes will be needed to resolve such a problem, which is out of the scope of this chapter. One way of resolving such a problem is to employ special hardware in the system setup, e.g., attaching blinking LEDs [24] to the pointer.

the following, some error analysis methods will be developed to investigate the influence of dynamic errors on the RPPs of the proposed systems. The goal is to correctly and efficiently identify the range of error in the position of RPP.

For the pointing system shown in Fig. 5.1, πP, CL and CR are fixed in position; therefore, RPP is decided by the reconstructed planes πL and πR, and in turn decided by pointer endpoints ILS, ILE, IRS and IRE. The process of the extraction of these points from stereo images is often influenced by the imaging noises mentioned above. As a result, the obtained pointer endpoints are not stable, so is the calculated RPP. Thus, the deviation of the RPPs due to the variations of ILS, ILE, IRS and IRE

will be the main focus of this chapter.

For a preliminary examination of the above deviation, simulated noises of unit magnitude are added to these pointer endpoints. In particular, 24 simulated points placed evenly (with 15^º spacing) along ”noise” circles with radius of 1 pixel are generated for ILS = (188,158), ILE = (247,189), IRS = (159,142), and IRE = (226,155) in Fig. 5.1, as shown in Fig. 5.2. In each run of the simulation, four points, each selected from one of the above four circles, are selected as endpoints of the pointer in the stereo images to reconstruct a RPP using aforementioned homographic transformations. Fig. 5.3(a) shows all 24⁴ RPPs (in red), with the convex hull of them (the range of reconstruction errors) shown in Fig. 5.3(b), computed from the 24 × 4 simulated points.

In general, it is desirable to have such a range calculated more efficiently, e.g., with less simulated endpoints of the pointer. However, a direct reduction in the data size may underestimate range of reconstruction errors. For example, the blue region in Fig. 5.4 is obtained by using only 4 points (with 90ºspacing) from each noise circle shown in Fig. 5.2.

From some close examinations of the relationship between the above reconstruction errors and the locations of the four simulated endpoints of the pointer obtained from Fig. 5.2, it is found that the error range is mainly due to (two) extreme values in the slopes of I_LS'I_LE'(and I_RS'I_RE').

Based on such an observation, we then try to use only the contacts of the internal common tangents (CICTs) of the two noise circles in each of the stereo images (see Fig. 5.2 for such tangents). The range of reconstruction error thus obtained is also shown in Fig. 5.4 (as four points connected by black line segments). One can see that such results almost coincide with that obtained using all (24) points from each noise circle of simulated points shown in Fig. 5.2. A closer examination can be carried out by comparing the coordinates of the vertices shown in Fig.

5.4, as listed in Table 5.1. Thus, estimation of the error range from a larger number of the simulated points (24 × 4) can be replaced by using only the 8 (2 × 4) CICTs with negligible change in the estimation, and with the number of reconstructed RPPs reduced greatly (from 24⁴ to 2⁴).

(a) (b)

Fig. 5.3. (a) RPPs for simulated points shown in Fig. 5.2. (b) Range of reconstruction errors (with error-free reconstruction show by an ”x”).

Fig. 5.4. Error range shown in Fig. 5.3(b) (red), similar range but obtained by using only 4 points (with 90◦ spacing) from each noise circle in Fig. 5.2 (blue), and error range based on internal common tangents (black, see text).

Table 5.1. Coordinates of the vertices shown in Fig. 5.4.

xmax xmin ymax ymin

(blue) 535.9857 455.9600 395.6483 330.5393

(red) 539.2251 453.4022 397.8248 328.1907

(black) 539.2422 453.2395 397.8823 328.1027

The above observations regarding CICTs of two noise circles, i.e., a RPP of a pointer from stereo images will be displaced much more when the pointer is rotated than if it’s translated with comparable amount of movements of its endpoints, can be explained with a simple example, as discussed in the following. Consider a pointing system with geometric configuration similar to that shown in Fig. 5.1, and assume the pointer is initially perpendicular to the projection plane.

When the pointer is translated by k in a direction parallel to the projection plane, the RPP will be

translated by k too. However, if we fix the endpoint of the pointer which is far away from the projection plane as the center of rotation and rotate the pointer by θ degrees such that the other end of the pointer is displaced by k = θr, with r being the length of the pointer, the RPP will have a displacement of k′ > θd with d being the distance from the pointer to the projection plane. One can see that if d >> r, which is often the case in various pointing situations, the amount of movement of RPP with a rotated pointer is much larger than that due to a translated pointer, or k′ >> k. Such an example reasonably explains why the estimated maximal error range (EMER) efficiently obtained using CICTs can represent the real error range with high accuracy, as the CICTs give the limits of the rotation angle of the pointer, with its end points confined to two noise circles in each of the stereo images.

The use of unit circle for noise is only to provide a baseline for error estimation, which can in fact be adapted for specific applications. For pointing systems based on the estimation of two ends of an elongated pointer, the idea of CICTs can be generalized easily and applied to the spatial supports, regardless of their shapes²⁹, of the error distributions of the two points to estimate the EMER of the pointing position. Such supports can be obtained for a static pointer in each view by observing its two ends for some time.

5.4 Experiments

In order to clearly verify the validity of the EMERs with respect to actual error distributions, we focus on the static pointing situation in the experiments, i.e., we fix the pointer in space and measure the locus of RPPs. Thus, additional sources interferences, e.g., due to multi-camera synchronization and/or motion blur of a moving pointer, can be avoided. The error analysis results obtained here can be applied in the future to situations involving highly dynamic pointing situations if these interferences can be well controlled or even eliminated, e.g., via better imaging hardwares. We will first examine the proposed error estimation method by placing the pointer at a position, and pointing to a position on the projection plane. Then, pointing results obtained by selecting of a pair of cameras for each RPP according to the EMERs are compared with those obtained by using all cameras.

Figs. 5.5(a) and (b) show an orange stick which is fixed in the workspace and is used in the

在文檔中以多攝影機進行人物定位 (頁 48-0)