• 沒有找到結果。

以多攝影機進行人物定位

N/A
N/A
Protected

Academic year: 2021

Share "以多攝影機進行人物定位"

Copied!
78
0
0

加載中.... (立即查看全文)

全文

(1)

資訊科學與工程研究所

以多攝影機進行人物定位

People Localization Using Multiple Cameras

研 究 生:羅國華

指導教授: 莊仁輝 博士

陳華總 博士

(2)

i

資訊科學與工程研究所

以多攝影機進行人物定位

People Localization Using Multiple Cameras

研 究 生:羅國華

指導教授: 莊仁輝 博士

陳華總 博士

(3)

ii

以多攝影機進行人物定位

People Localization Using Multiple Cameras

研 究 生: 羅國華 Student: Kuo-Hua Lo

指導教授: 莊仁輝 Advisor: Jen-Hui Chuang

陳華總 Hua-Tsung Chen

國 立 交 通 大 學

資 訊 科 學 與 工 程 研 究 所

博 士 論 文

A Thesis

Submitted to Institute of Computer Science and Engineering College of Computer Science

National Chiao Tung University in partial Fulfillment of the Requirements

for the Degree of Doctor of Philosophy

in

Computer Science

February 1, 2013

Hsinchu, Taiwan, Republic of China

(4)

iii

以多攝影機進行人物定位

研究生:羅國華

指導教授: 莊仁輝

陳華總

國立交通大學資訊科學與工程研究所

摘要

在以視覺為基礎的人物定位與追蹤的研究中,人物遮掩是一個重要且具挑戰性的研究課 題。為了處理這樣的問題在本博士論文中,我們提出數個以多攝影機進行人物定位的方法。 先前被提出的方法是藉由將多個視角影像中的前景資訊投影至多參考平面來確認空間中不 同高度的參考平面上是否有人物存在,因此比起僅使用單一參考平面,將能夠有效地處理人 物遮掩之問題,然而這將使得計算量隨著參考平面與使用的畫面數量而大幅增加。為了減低 上述投影所需之計算,我們提出了第一個方法:基於線段取樣式定位法。此方法可利用影像 中垂直於地面直線的消失點,估計出人物的成樣本線段,如此一來,在各高度參考平面上的 人物定位將僅需計算線段的交點來重建出人物的位置,而能夠大量地減少先前的作法中需將 前景資訊投影於多重平面的計算量。接著我們對這些交點進行分析後,將不同平面的交點進 行連線即可形成三維樣本線段。這些樣本線段經過品質的評估,並淘汰掉不合適的軸線後, 依據分群的演算法被分為數群,再依照各群內的三維軸線整合的結果推算出人物的位置。 然而由於上述的方法在重建時仍需要較多的時間,為了更進一步地改善其效率,我們提 出了第二種非重建型的人物定位方法。此方法不需要將所有的前景資訊投影到多重平面上, 而是先初步地以足跡分析估計出人物的潛在位置,再產生三維樣本線段來確認人物所在的位 置。這樣一來不僅改善了我們的計算速度,同時也可將人物的高度在計算的過程中估計出來。 另外,我們也針對第一種方法進行改良,提出第三種人物定位方法。其主要的兩項改良為: (1) 新的兩個垂直三角形的相交重建方式與微調步驟來找出人物可能的三維樣本線段,(2)新增 兩項與頭部高度有關的幾何過濾規則,用來過濾這些三維樣本線段。兩者皆能夠改進定位正 確性,包含了精確率與查全率(precision and recall) ,而(2)則能提升計算的效率。此外,我 們還提出了一個具有視角不變之特性的線段對應性的測量方法,能以量化方式測量不同視角 影像中任意線段之對應性。我們更進一步地將其應用於人物定位方法之上,不但改善了效率

(5)

iv

而且並未減低其定位的正確性。最後我們探討了利用樣本線段之間的對應性以及兩個視角之 間的角度,來進一步地降低人物定位誤差的可能性。

(6)

v

People Localization Using Multiple Cameras

Student: Kuo-Hua Lo

Advisor:

Jen-Hui Chung

Hua-Tsung Chen

Institute of Computer Science and Engineering

National Chiao Tung University

Abstract

Occlusion has been an important and challenging task in vision-based people localization and tracking. To handle this problem, we propose several people localization methods in this thesis, which are based on multiple cameras. Some existing methods have been proposed to check the existence of people at reference planes of different heights by projecting image foreground from multiple views to these planes; such approaches can deal with occlusions better than using only a single reference plane. In order to reduce the amount of calculation due to image projection, especially for a large number of reference planes and camera views, we first propose a sample line-based method. The method estimates 2D line samples, which are originated from the vanishing point of lines perpendicular to ground plane, for each person in different images and project these 2D line samples on reference planes to reconstruct people locations so that the computation of previous work can be greatly reduced. For the subsequent localization process, these intersection points are analyzed and integrated to form some 3D line samples, and these 3D line samples are then grouped and integrated to reconstruct the locations of people in the scene.

Because the above method still takes a lot computation during the reconstruction of 3D line sample, we propose the second method which is not based on reconstruction by projecting all foreground pixels to multiple reference planes. In particular, a footstep analysis is developed to find potential people locations, and 3D line samples are then generated to identify people locations. This method results in significant improvement in computational efficiency, with people heights being estimated as by-product. We proposed another method to improve the performance of the first method with (i) new reconstruction from the intersection of two vertical triangles and refinement procedures for possible 3D (vertical) line samples of human body and (ii) addition of two new geometric rules (associated with the head level of a person) for the screening of these samples. While (i) reconstructs a 3D line sample directly (and efficiently). Both of them offer valuable improvements in the localization performance, in terms of precision and recall, with (ii) also saving some computation time spent for invalid samples. In addition, we also propose a correspondence a view-invariant measure of 2D line segments in two different views. Such a quantitative measure can handle line segment of arbitrary configuration in the 3D scene. By

(7)

vi

applying such a measure, efficiency of people localization is further improved without sacrificing the localization correctness. Finally, possibilities of using the correspondence of line samples and the difference between a pair of viewing angles to decrease the error of people localization as studied, with some promising results obtained.

(8)

vii

Contents

摘要 ... iii Abstract ... v Chapter 1 Introduction ... 1 1.1 Monocular approaches ... 1 1.2 Multi-camera approaches ... 2

1.3 Organization of the thesis ... 4

Chapter 2 Vanishing point-based line sampling for efficient people localization ... 6

2.1 Construction of major axes for non-occluded persons from a pair of views ... 6

2.1.1 Major axis estimation for a person in an image ... 6

2.1.2 Finding a 3D major axis of a person – two approaches ... 6

2.1.3 Extension of finding 3D major axes for non-occluded multiple persons from a pair of views ... 8

2.2 Construction of major axes for multiple persons with occlusion ... 9

2.2.1 Generating 3D line samples using vanishing points ... 10

2.2.2 Integration of 3D line samples to form 3D major axes ... 11

2.3 Experiments ... 12

2.4 Summary ... 14

Chapter 3 Acceleration of vanishing point-based line sampling scheme for people localization and height estimation via footstep analysis ... 15

3.1 Finding candidate people regions (blocks) ... 15

3.2 People localization and height estimation ... 17

3.3 Experiments ... 17

3.4 Summary ... 21

Chapter 4 Enhancement of line-based people localization ... 22

4.1 Efficient 3D line construction from intersection of two triangles ... 22

4.2 Refinement and verification of reconstructed 3D line samples ... 22

4.3 Early screening for line correspondence ... 24

4.3.1 A view-invariant measure of line correspondence ... 24

4.3.2 Applying the line correspondence measure to improve the efficiency of people localization ... 26

4.4 Experiments ... 28

4.4.1 Applying the improvements described in Sections 4.1 and 4.2 ... 28

4.4.2 Applying the improvements described in Section 4.3 ... 34

4.5 Summary ... 36

(9)

viii

5.1 Motivation ... 37

5.2 An experimental pointing system ... 39

5.3 Error analysis ... 40

5.4 Experiments ... 43

5.5 Summary ... 48

Chapter 6 Conclusions and future works ... 51

Appendix A The derivation of multiple homographic matrices for planes of different heights ... 52

Appendix B Setting the parameters ... 53

Appendix C Two types of synergy maps ... 57

Appendix D The preprocessing step ... 58

Appendix E Reconstruction of pointing points by homographic transformations ... 59

Bibliography ... 60

(10)

ix

List of Figures

Fig. 2.1. Detected foreground regions and the estimated axis. ... 7

Fig. 2.2. Finding intersection points of two axes on a reference plane. ... 7

Fig. 2.3. The axis samples of the person shown in Fig. 2.1, which are reconstructed for reference (horizontal) planes with 4 cm spacing and up to 176cm in height. ... 8

Fig. 2.4. Illustration of filtering out incorrect 3D MAs by using an extra view. ... 9

Fig. 2.5. An example of overlap foreground and the estimated axis. ... 9

Fig. 2.6. (a)-(d) 2D line samples in Views 1-4. (e) The unverified 3D line samples which survive Rules 1-2. (f) The results of filtering and grouping. ... 10

Fig. 2.7. Grouping and localization results. (a) Input frame 532. (b) Grouping sets. (c) Accumulated synergy map of all reference planes. ... 12

Fig. 2.8. Localization results for frame 475 and 540. ... 13

Fig. 2.9. Processing speed (in frame rate per second) of (a) Our method. (b) The generation of accumulated synergy map from all reference planes. ... 13

Fig. 3.1. Schematic diagram of the proposed people localization framework. ... 16

Fig. 3.2. Finding candidate people blocks (CPBs) by two-layered grids. (a) Layer 1 grid. (b) Layer 2 grid. (c) Merging the two-layered grids. ... 16

Fig. 3.3. Building and refining 3D virtual rods. ... 17

Fig. 3.4. An instance of scenario S1, captured from four different viewing directions. ... 19

Fig. 3.5. Localization results for scenario S1. (a) Segmented foreground regions and 2D line samples for Fig. 3.6(b). (b) 3D major axes to represent different persons in the scene. (c) Localization results illustrated with bounding boxes. ... 19

Fig. 3.6. Localization results, similar to those shown in Fig. 3.7, for scenario S2. ... 19

Fig. 3.7. Localization results, similar to those shown in Fig. 3.7, for scenario S3. ... 19

Fig. 3.8. Results of height estimation for S1. ... 20

Fig. 3.9. Results of height estimation for S2. ... 20

(11)

x

Fig. 4.1. Illustrations of the simplified 3D reconstruction. ... 23 Fig. 4.2. Filtering results of input images shown in Figs. 2.6(a)-(d). (a) The unverified 3D

line samples which survive Rules 1-3, (b) the refined line samples which survive Rules 1-4, (c) final line samples (see text). ... 24 Fig. 4.3. (a) Illustration the basic idea of the proposed correspondence measure of two line

features (samples). (b) Illustration of a general form of the view-invariant cross ratio. ... 25 Fig. 4.4. Procedure to determine whether two line samples are likely to represent the same person. ... 27 Fig. 4.5. Illustration of numerical values of the proposed line correspondence measure (see text). ... 27 Fig 4.6. A failure example of the proposed method. (a)-(d) The localization results

(illustrated with bounding boxes) of four views. (e)-(h) Corresponding

foreground regions and 2D line samples. (i) 3D line samples to represent different persons in the scene. ... 30 Fig. 4.7. An example of miss detections and false alarms of S3. (a) Segmented foreground

regions and 2D line samples. (b) 3D line samples to represent different persons in the scene. (c) The localization results illustrated with bounding boxes. Note that corresponding colors are used in (b) and (c) for different groups/bounding boxes after grouping. ... 30 Fig. 4.8. Localization results for scenario S4. ... 30 Fig. 4.9. Localization results for scenario S5. ... 30 Fig. 4.10. Results of using different line densities (pixel-spacings, see text) with four

cameras. (a) Recall and precision. (b) Localization error. (c) Computation speed.32 Fig. 4.11. A more challenging localization example for a busy street scene. (a)-(d) The

localization results (illustrated with bounding boxes) of four views. (e)-(h) Corresponding foreground regions and 2D line samples. (i) 3D line samples to represent different persons in the scene. ... 35 Fig. 5.1. Configuration of the pointing system and the reconstruction of a pointing point.38

(12)

xi

Fig. 5.2. Noise circles (simulated points) for the pointer endpoints located in stereo images shown in Fig. 5.1, and their CICTs (see text). ... 39 Fig. 5.3. (a) RPPs for simulated points shown in Fig. 5.2. (b) Range of reconstruction errors (with error-free reconstruction show by an ”x”). ... 42 Fig. 5.4. Error range shown in Fig. 5.3(b) (red), similar range but obtained by using only 4

points (with 90◦ spacing) from each noise circle in Fig. 5.2 (blue), and error range based on internal common tangents (black, see text). ... 42 Fig. 5.5. (a) Left image. (b) Right image. (c) EMER and actual RPPs. ... 44 Fig. 5.6. (a) Layout of the synthesized room. (b) Pointing positions on the projection plane.

... 44 Fig. 5.7. Estimated maximal error ranges for different camera pairs: (a) C1&C2. (b) C2&C3. (c)

C1&C3. (d) C2&C4. (e) C1&C4. (f) C3&C4. ... 47

Fig. 5.8. (a) Image captured by C1 when the pointer is pointing toward P2. (b) Image

captured by C3 when the pointer is pointing toward P2. (c) Image captured by C2

when the pointer is pointing toward P8. (d) Image captured by C4 when the

pointer is pointing toward P8. ... 48

Fig. 5.9. Estimated maximal error ranges for the pointer moved left 150cm for different camera pairs: (a) C1&C2. (b) C2&C3. (c) C1&C3. (d) C2&C4. (e) C1&C4. (f) C3&C4. ... 49

Fig. 5.10. Image captured by C1 when the pointer is pointing toward P7. (b) Image

captured by C2 when the pointer is pointing toward P7. ... 50

Fig. 5.11. Distribution of RPPs of the nine pointing positions for the pointer placed at (a) (250, 100, 350) and (b) (100, 100, 350). ... 50 Fig. A.1. Illustration of calculation a reference point on πr. ... 52 Fig. B.1. Results of using different values of Tlen. (a) Recall and precision. (b) Mean

localization error. (c) Computation speed. ... 55 Fig. B.2. Results of using different values of Tfg. (a) Recall and precision. (b) Mean

localization error. (c) Computation speed. ... 55 Fig. B.3. Results of using different values of Nplane. (a) Recall and precision. (b) Mean

(13)

xii

Fig. B.4. Results of using different values of Tc. (a) Recall and precision. (b) Mean

localization error. (c) Computation speed. ... 55 Fig. B.5. Results of using different values of Nline. (a) Recall and precision. (b) Mean

localization error. (c) Computation speed. ... 56 Fig. B.6. Results of using different values of Tlen for S1. (a) Recall and precision. (b) Mean

localization error. (c) Computation speed. ... 56 Fig. B.7. Results of using different values of Tlen for S2. (a) Recall and precision. (b) Mean

localization error. (c) Computation speed. ... 56 Fig. C.1. (a)-(d) Foreground likelihood maps. (e) The synergy map used in [25]. (f) The

synergy map obtained by using binary foreground images. ... 57 Fig. D.1. (a) An input image. (b) The detected pointer and its bounding box. ... 58

(14)

xiii

List of Tables

Table 3.1. Performance of the proposed approach in this chapter. ... 19

Table 3.2. Performance of people localization of [26]. ... 20

Table 4.1. Localization results of sequences S1-S3. ... 29

Table 4.2. Localization results of sequences S4 and S5. ... 32

Table 4.3. Results of using different numbers of cameras. ... 32

Table 4.4. Filtering results of Fig. 4.4... 35

Table 4.5. Localization results of sequences S1-S3. ... 35

Table 4.6. Localization results of the method proposed in Sections 4.1, 4.2, and 4.3. ... 35

Table 5.1. Coordinates of the vertices shown in Fig. 5.4. ... 42

Table 5.2. Suggestion of camera pairs. ... 45

Table 5.3. Pointing errors of the two methods for the pointer placed at (250, 100, 350). ... 46

Table 5.4. Pointing errors of the two methods for the pointer placed at (100, 100, 350). ... 47

Table B.1. Recommended value ranges of parameters for S1-S3. ... 54

(15)

1

Chapter 1

Introduction

In recent years, visual surveillance using multiple cameras has attracted much attention in the computer vision community. Moreover, vision-based localization and tracking have shifted from monocular approaches to multi-camera approaches since the latter can often achieve better results. Especially when there are many people in the scene, serious occlusions may occur in multiple views and real-time people tracking and localization become a challenging problem. Thus, the previous works on visual surveillance are reviewed in the following in two categories: monocular approaches and multi-camera approaches.

1.1 Monocular approaches

In [1], [2], location and intensity of image foreground are extracted to allow construction of a human model, which allows us to match a subject image for tracking in successive grayscale images. In [3], color information is used to construct human models, wherein a person is modeled by several parts of similar color, and a Bayesian framework is employed to handle occlusion in the tracking process. In [4], an extension of particle filter using object contour is proposed to track the head of a person. In [5], a color-based tracking which integrates color distributions into particle filter is presented to describe people using ellipses and associated color histograms. The method is robust when dealing with partial occlusion, and is rotation and scale invariant. In [6] color, shape, and edge are integrated into particle filter to create a robust tracking method. Additionally, the authors propose an adaptive scheme to choose the most effective cues in different situations. However the performance of these methods might be seriously impaired when the human model of occluded persons is not updated in time that the appearance of a person may change significantly. To resolve such a problem, spatial/temporal features are used in [7] to train convolutional neural networks to achieve robust people tracking wherein the appearances of a target object of different views are adopted in the training stage.

Since single view tracking depends on inherently limited information from a single viewing angle, dealing with situations involving serious or full occlusions is quite difficult. Thus, many multi-view tracking approaches have been proposed. Unlike single view, multiple views can provide more visual information to cope with occlusions in human localization. For example, a stereo camera with small baseline can estimate depth information easily, whereas a set of wide-baseline cameras can decrease invisible regions. Finding feature correspondence is usually the most important step for many multi-camera approaches since only correct correspondences

(16)

2

between multiple cameras can ensure the correctness of subsequent processes, e.g., localization and tracking.

1.2 Multi-camera approaches

There are several types of multiple camera approaches for tracking people. The first type of approaches uses a stereo camera to obtain depth maps for tracking. The second type of approaches can be divided further into two sub-categories, region-based and point-based methods, both have to establish correspondence between different views for tacking. The third type of approaches seeks to find locations of persons directly without the correspondences of people in different views.

For the first type of approaches such as [8–10], a stereo camera is exploited to establish correspondence between two views to construct a depth map. By using such a map to avoid influences of moving shadows on foreground detection, better segmentation results can be obtained and object tracking becomes more robust. However, using a pair of cameras with a small baseline may suffer from total occlusions frequently. Without information of occluded regions (e.g., behind of a person closer to a stereo camera), the tracking performance is impaired.

Region-based methods of the second type generally regard people as regions and use region features to match people in multiple views. Most of these methods use color as the main feature to find correspondences of regions in different views. For instance, color and 3D position are utilized to match and track multiple objects by a tracking algorithm in [11]. In [12], the authors use Gaussian color models to segment foreground regions of people from each image. The results are then used to match regions from one view to another along epipolar lines to find correspondence across multiple views. After that, Kalman filters are used to track people on the ground plane. In [13], the authors use Bayesian networks for object tracking in individual views independently. After that, both geometry-based (epipolar geometry, homographies, and landmarks) and recognition-based (height and color of target appearance), are utilized to find correspondence across multiple views. However, one of the main disadvantages of these methods is that color information may degrade the performance of tracking since the appearance and color can change with scene illumination.

Point-based methods can be further divided into two additional sub-categories: 3D-based and 2D-based methods. 3D-based methods locate and find correspondence of target object in images based on 3D geometric constraints. These 3D-based methods often need a complete camera calibration. In [14], location of a person is described by a Gaussian distribution of its center of gravity (COG) in the scene. The distribution, which denotes the probability of the existence of a

(17)

3

COG point, is projected onto multiple views, respectively, and the correspondence of feature point can be found by maximizing the probability of the COG distribution in each view. In [15], people are modeled as vertical cylinders and tracked by optical flow. During the tracking process, the COG of human body in multiple views is used to estimate the people locations in the world coordinate. In [16], cameras are calibrated for the calculation of 3D positions of feet points of target people, and the correspondences can be established from these feet points. In [17], feature points are extracted from a (vertical) major line of the upper part of a human body. The correspondence of the human body is found by matching intensity and location through epipolar constraints. However, the extracted feature points from each view may not always correspond to the same point in the 3D space. In that case, the matching performance, the established correspondence, and tracking results may be impaired.

Different from the above 3D-based methods, some 2D-based methods has been presented to establish correspondences between multiple cameras by matching locations of feature points on a reference plane. In [18–20], homography constraint is used to match the locations of feet points in different views. However, these feature points may be occluded between objects. In [20], a method, which can detect whether the feet points of a person are occluded, is proposed to select a best view for each person appears in the scene. In contrast, authors in [21] propose a method using the axes of people to estimate the feet points in images. They segment a group of people into individual persons and estimate an axis for each of them. Then, the location of the feet point of a person is estimated as intersection point of his/her axis and the bottom of his/her bounding box. In [22], foregrounds of a person are perspectively projected from each view to the ground plane, with the corresponding camera being the projection center. For each camera, a line passing through (i) the projected foreground and (ii) the vertically projected camera center, both on the ground plane, is estimated. The person’s location can then be estimated by calculating the intersection of these estimated lines on the ground plane based on the least square criterion. For most of the aforementioned point-based approaches, accurate detection/estimation of point/line features, and their correspondences in different views, are required; otherwise the correctness of a person’s location will be seriously impaired.

In recent years, approaches of the third type are proposed. These methods, which do not need a complete camera calibration, can locate people directly without finding the correspondences of the people between views. In [23-24], the authors propose a method using cameras placed at high elevation to detect the heads of people. The method assumes the cameras are partially calibrated for homographic matrices for multiple planes with different heights. For each plane, intensity information of segmented foreground pixels is collected from all views, and head detection is

(18)

4

achieved through intensity correlation. In [25], the authors propose an interesting method to track people by locating them on similar reference planes. The foreground likelihood information of all image pixels captured from different views is projected and integrated on each reference plane to form an occupancy probability. Such probabilities from several frames are then processed by a graph cut algorithm to find trajectories of people. Although the correspondences of people between different views are not available1, such an approach performs quite well under serious occlusions in a crowed scene. Due to the high complexity of pixel-based processing, the approach is implemented with CUDA (Nvidia GeForce 7300 GPU) to achieve real-time performance.

Unlike the above method that need to project all foreground pixels of all views to multiple reference planes via homography, we propose three efficient and effective people localization methods. The first one applies vanishing point-based line sampling to reduce the large amount of pixel processing so that computational efficiency can be greatly enhanced. The second one further improves efficiency and robustness of the first one by adopting a more accurate 3D reconstruction process, more effective geometric filtering rules, and a novel measure of line correspondence. Instead of 3D reconstruction, the third method uses a coarse-to-fine strategy to find people locations by 3D line sampling. Finally, error analysis is considered for further improvement of the accuracy of people localization for the second method.

1.3 Organization of the thesis

The remainder of this thesis is organized as follows. In Chapter 2, people localization via vanishing points of vertical lines and multiple homographic matrices is proposed. The vanishing points are used to generate 2D line samples of foreground regions in multiple views. Potential people locations are found by project each pair of 2D line samples from different views to the reference planes of different heights via homographic matrices. The intersection points are then connected to form 3D line samples. After that, the 3D line samples are checked against foreground regions of all views and grouped to locate people. Instead of reconstruction in the 3D space, we propose a grid-based approach to efficiently find potential people locations on the ground in Chapter 3. We then generate 3D sample lines for these potential people locations, refine their two ends, and remove those not covered by enough foreground pixels in all views. Additionally, people heights are estimated from the 3D line samples as by-products. In Chapter 4, a more efficient reconstruction method is proposed to improve the people localization approach described in Chapter 2, where reconstruction of 3D line samples takes a lot of computation time to project

1 For example, no additional image processing procedures are performed to identify each individual from a crowd, e.g., through connected

(19)

5

2D line samples to multiple reference planes, a more efficient reconstruction approach which reconstructs a 3D line sample as the intersection of two vertical triangles is proposed. In addition, a pre-filtering procedure using a view-invariant measure of line correspondence is also introduced to further improve the efficiency. In Chapter 5, we first review an error analysis method for a pointing system. The idea is then extended and applied to our people localization method described in Chapter 3 to increase the accuracy of localization. Chapter 6 summarizes this thesis.

(20)

6

Chapter 2

Vanishing point-based line sampling for efficient people

localization

In this chapter, vanishing point-based line sampling is introduced to increase computation speed of people localization. The vanishing points of vertical lines in the scene in images captured from different viewing angles are used to generate 2D line samples of foreground regions. Subsequently, 3D line samples of persons can be found efficiently via 3D reconstruction from stereo 2D line sample pairs to avoid pixel-based operations suggested in [23-25].

2.1 Construction of major axes for non-occluded persons

from a pair of views

For a better understanding of the basic ideas of the proposed localization, we begin by illustrating how to localize people using the major axes (MA) of the foreground regions in 2D images. Assume the foreground of different persons do not overlap in a pair of views in which the major axis of each of them can be estimated correctly. By projecting these axes, instead of projecting all foreground pixels as in [25], onto multiple reference planes parallel to the ground plane, a 3D axis can be formed for each person by connecting corresponding intersection points of the projected 2D axes on these reference planes. Furthermore, a more efficient scheme is introduced to find the above 3D axis by calculating the intersection line segment of two triangles in the 3D space if the cameras centers can be estimated in advance.

2.1.1 Major axis estimation for a person in an image

In order to segment foreground regions of a person from an image, the Gaussian mixture model (GMM) [27], [28] can be applied. Assume region R obtained from foreground segmentation contains a great percentage of a person, we can estimate the major axis for the person by PCA. An example of an axis thus estimated is shown in Fig. 2.1. One can see that the estimated major axis can represent the elongated shape of a person very well.

2.1.2 Finding a 3D major axis of a person – two approaches

As shown in Fig. 2.2, Let L1 and L2 be the axes of a person obtained by PCA for View 1 and

View 2, respectively. In addition, let P12 be the intersection point of the two lines containing the projections of L1 and L2, respectively, onto reference (ground) plane π from camera centers C1 and

(21)

7

Fig. 2.1. Detected foreground regions and the estimated axis.

Fig. 2.2. Finding intersection points of two axes on a reference plane.

C2. Ideally, for reference planes of different heights, such intersection points will either (i) belong

to both the projected axes, or (ii) stay away from any of them if the corresponding heights are out of the range of the 3D axis. Fig. 2.3 shows samples of the 3D axis thus obtained for the person shown in Fig. 2.1. While intersection points satisfying (i) is colored in black, points not satisfying (i), including those contained in one but not both projected axes due to computation errors, are marked in red2.

The above results provide us an important cue to the estimation of a person’s height. Additionally, one can see that the 2D (horizontal) positions of these 3D points are quite consistent that a roughly vertical major axis (MA) of the person can be constructed by connecting the black points, i.e.,

b t

t bh h h h P P set Axis_ 1,21,2,..., 1,2 (2.1) with hb and bt being the heights of bottom and top end points of the axis, respectively.

2 To find the above intersection points on reference planes of different heights, a method to produce multiple homographic matrices is

introduced which can establish these matrices using only two marker points on each of the four calibrating pillars standing vertically on the ground plane. The detail can be found in Appendix A.

(22)

8

Fig. 2.3. The axis samples of the person shown in Fig. 2.1, which are reconstructed for reference (horizontal) planes with 4 cm spacing and up to 176cm in height.

2.1.3 Extension of finding 3D major axes for non-occluded multiple

persons from a pair of views

The above method can be extended to estimate 3D MAs for multiple people if an axis can be found for each of them in two different views. Without knowing the correspondence of the axes in the two views, candidate 3D MAs can be constructed for all possible 2D MA pairs. For example, for M persons in View 1 and N persons in View 2, a total of MN candidate MAs can be constructed (minus those associated with triangle pairs which do not intersect, like the two blue triangles shown in Fig. 2.4).

For a candidate 3D MA obtained for person i in View 1 and person j in View 2, (1) can be rewritten as

b t

t b h j i h j i h h j i P P set Axis _ 1,21,2 ,..., 1,2 (2.2) Although we do not have correspondences of different people in these two views, it is possible to remove incorrect 3D MAs by checking the consistency in the foreground coverage, as will be explained in Subsection 2.2.1, with additional views. For example, while the two green axes in Fig. 2.4 are correct 3D MAs, the gray axis can be identified as an invalid axis from View 33.

3 In general, incorrect MAs constructed from a pair of triangles can be removed by checking the consistency with an additional view point (in

the 3D space) except for those view points which are coplanar (in a 2D subspace) with one of the two triangles mentioned above. Therefore, with the help of an additional camera, incorrect MAs will be removed completely, with zero probability for the above exceptions.

(23)

9

Fig. 2.4. Illustration of filtering out incorrect 3D MAs by using an extra view.

Fig. 2.5. An example of overlap foreground and the estimated axis.

2.2 Construction of major axes for multiple persons with

occlusion

The above 2D PCA-based axis estimation can only cope with situations under which the foreground of a person is separable from others’ in all views, and can be identified as one region by connected component analysis. However, in real applications, many people may appear in a monitored scene at the same time that each segmented foreground area may contain more than one person, as shown in Fig. 2.5, and the aforementioned axes detection approach will not work correctly. One possible solution proposed in [21] is to separate persons by projecting the foreground in the vertical direction to form a histogram, and then determining the boundaries between persons based on the location of peaks and valleys in the histogram, before each person can be represented by one axis for localization and tracking. However, the above approach may not work well when there is a very dense group of people appear in the scene, e.g., for the case shown in Fig. 2.6. For such more complicated situations, instead of estimating a 2D axis for each person, a 3D sampling scheme is proposed in this section wherein 2D line samples of the foreground regions from multiple views are used to generate some 3D line samples of the

(24)

10

(a) (b) (c) (d)

(e) (f)

Fig. 2.6. (a)-(d) 2D line samples in Views 1-4. (e) The unverified 3D line samples which survive Rules 1-2. (f) The results of filtering and grouping.

foreground “volume” based on the same idea described in Section 2.1. Then, with noises filtered out, these 3D line samples are verified with respect to different views by a back projection procedure. Finally, a grouping algorithm is applied to the remaining samples in the scene, before members of each group are integrated into a 3D MA.

2.2.1 Generating 3D line samples using vanishing points

Since the upper bodies of people are almost always perpendicular to the ground plane when they are standing and walking in a monitored scene, we first generate 2D line samples in each view which are originated from the vanishing point of vertical lines in the 3D scene (see Figs. 2.6(a)-(d))4. Thus, these 2D line samples correspond to a fan of vertical sampling slices in the 3D space originated from the vertical line containing the corresponding camera center. Note that generating 2D line samples is much faster than the axis estimation discussed in Section 2.1 since no additional image processing is required. The 2D sampling lines having very short lengths (less

4 The vanishing point in each view can be estimated by calculating the intersection points of the four lines extended from the four upright pillars

(25)

11

than a threshold Tp) will be discarded since they are expected to be far away from a major axis and

will have little contribution to the estimation of a 3D MA.

Next, for each pair of views, the remaining 2D line samples are used to reconstruct 3D line samples by the scheme described in Section 2.1. Since there may still be incorrect 3D line samples, such as the gray one shown in Fig. 2.4, two geometric rules can be used to filter out the 3D line samples that will not correctly represent a person in the 3D scene:

1) The length of a 3D line sample is shorter than Tlen,

2) The height of its bottom end point Phb is higher than Tb.

Fig. 2.6(e) shows 3D line samples passed the two rules, each adjusted slightly so that it is perpendicular to ground plane.

After using the above two filtering rules, we further verify the 3D line samples against image foreground. To check the foreground coverage of a 3D line sample, we back-project its intersection points of different heights to all image views. For a person do appear in the monitored scene, these back-projected points should be covered by some foreground regions. For example, if all back-projected points in all views for a 3D MA are of foreground, its average foreground coverage rate (AFCR) is equal to 100%. A 3D line sample with AFCR lower than Tfg will be

removed. Fig. 2.6(f) illustrates the filtering results for line samples shown in Fig. 2.6(e).

2.2.2 Integration of 3D line samples to form 3D major axes

After the above verification procedure, the major axis of a person can be estimated from the remaining 3D line samples using a straightforward grouping algorithm5. Specifically, if the 2D horizontal distance between two 3D line samples is closer than a threshold Tc, an edge is

established in an undirected graph. After that, we can easily find connecting component areas (3D line sample groups) in the graph. For example, Fig. 2.7(a) shows the input frame for Fig. 2.6(d), and Fig. 2.7(b) shows the undirected graph obtained by the above grouping algorithm, with green points representing the 3D line samples. To avoid some false positives in the grouping, a group containing a total number of 3D line samples less than threshold Nline will be removed.

To locate individual persons, the horizontal position of each of them can be estimated as the average, shown as red stars in Fig. 2.7(b), of the horizontal positions of the 3D line samples in the corresponding group6. In Fig. 2.7(c) we show the synergy map obtained with a method modified from [25]. Instead of considering the foreground probability of all image pixels, only those inside of foreground regions are taken into account. One can see the above distribution of each group

5 Detail can be found in [46].

6 The heights of the top and bottom ends of a 3D major axis are assigned as the heights of the highest and lowest end points in the

(26)

12

(a)

(b) (c)

Fig. 2.7. Grouping and localization results. (a) Input frame 532. (b) Grouping sets. (c) Accumulated synergy map of all reference planes.

matches the corresponding occupied region (red color) in the map quite well, i.e., all red stars do fall inside of the occupied regions.

2.3 Experiments

In order to evaluate our method, we used an indoor video with a resolution of 320 × 240. The spacing between 51 adjacent reference planes was selected as 4cm. In the video, six people are walking along three edges of the tiles on the ground so we can easily evaluate the performance of localization. In Figs. 2.8(a) and (b), the bounding boxes with a fixed cross-section of 50cm x 50cm are back-projected to individual images with their height obtained from derived 3D MAs, shown on the right of the figures with bold lines. One can see that the six persons are well represented with these bounding boxes, and their lo cations having good matches with the specified tracks. For a comparison of computation time with [25], simulation is performed with an implementation based on C language on Windows 7 with, 4 GB RAM and a 2.4G Intel Core2 Duo CPU. Fig. 2.9(a) shows the processing speed, in frame rate per second (FPS), of our method for different portions of the video, with intervals A to F

(27)

13

(a)

(b) Fig. 2.8. Localization results for frame 475 and 540.

(a) (b)

Fig. 2.9. Processing speed (in frame rate per second) of (a) Our method. (b) The generation of accumulated synergy map from all reference planes.

(28)

14

corresponding to an increase from 1 to 6 persons in the scene, respectively. One can see that the processing speed varies with people count and more than 2.790 FPS can be achieved when there are six people in the scene. The average is 5.365 FPS. Fig. Fig. 2.9(b) shows the FPS required for the generation of synergy maps, as proposed in [25], which varies much less with time and has an average value of 0.118 FPS. (Note that CUDA adopted in [ 25] is not used here). This is because its time complexity mainly depends on the size of the whole image but not just the foreground.

2.4 Summary

We proposed a method for people localization which obtains 2D line samples, with each line originated from the vanishing point of vertical lines in the scene, of foreground regions in each view. Geometrically, a pair of line samples obtained from two different views corresponds to a vertical line in the scene. 3D point samples along such a vertical line can then be obtained by projecting the above 2D line samples and identifying their intersecting point on reference planes of different heights, using homographic matrices each associating an image to a reference plane. Finally, the 3D MA of each person is estimated by grouping 3D line segments derived from point samples satisfying some location and shape constraints. Since the most time-consuming process of homographic projections are performed for line samples instead of the whole image, the proposed approach can achieve near-real time performance for localization accuracies similar to that in [25].

(29)

15

Chapter 3

Acceleration of vanishing point-based line sampling

scheme for people localization and height estimation

via footstep analysis

In this chapter, the efficiency of the above line sample-based approach is further improved by considering only one reference (ground) plane and, without performing 3D reconstruction, adopting a 3D line sampling scheme. Fig. 3.1 illustrates the schematic diagram of the proposed framework. First, the preprocessing procedures of camera calibration and foreground segmentation are executed. Next, we generate lines originated from the vanishing point of vertical lines in the scene to sample the foreground objects (people) in each camera view, as in [26]. The line samples of foreground objects from all camera views are then projected onto the ground plane via homography, with regions crossed through by a large number of projected sample lines identified as candidate people regions. We then generate (vertical) 3D sample lines for these candidate people regions, refine their two ends, and remove those not covered by enough foreground pixels in all views. Finally, the remaining 3D sample lines are grouped into individual axes to indicate people locations. Additionally, the height of each person can also be estimated as by-product.

3.1 Finding candidate people regions (blocks)

According Fig. 3.1, we first generate 2D sample lines, originated from the vanishing point, of foreground regions in each camera view. The sample lines containing very few foreground pixels are discarded since they contribute little to the following localization process. Then, the remaining sample lines are projected onto the ground plane via homography. It is easy to see that the more a region is crossed through by the projected sample lines, the more likely the region contains a person. Thus, we discretize the ground plane into a grid of 50cm  50cm blocks, each has about the area a standing person occupies, and count the number of crossing sample lines for each block.

However, the above line counts may distribute across neighboring blocks, as shown in Fig. 3.2(a). Thus, we add a second grid, which has an offset of 25cm in both X and Y directions (on the ground plane) from the first one. Note that the second grid can have higher counts in some grids for the above example, as shown Fig. 3.2(b). After merging the two layers of grids, we retain the higher count for each quarter block, as illustrated in Fig. 3.2(c). Finally, the quarter blocks whose counts are greater than a threshold Tcn7 are identified as candidate people blocks (CPBs).

7 We set T

(30)

16

Fig. 3.1. Schematic diagram of the proposed people localization framework.

(a) (b) (c)

Fig. 3.2. Finding candidate people blocks (CPBs) by two-layered grids. (a) Layer 1 grid. (b) Layer 2 grid. (c) Merging the two-layered grids.

(31)

17

Fig. 3.3. Building and refining 3D virtual rods.

3.2 People localization and height estimation

In this section, to achieve the goal of people localization and height estimation, vertical line samples of human body are generated for the above CPBs. These line samples are then refined with respect to image foreground from different views, screened by some physical properties of human body, and grouped into axes of individual persons. In particular, four equally-spaced rods of 200cm in height are established on each CPB, as shown in Fig. 3.3. For each rod, we back-project it onto each camera view, and inwardly refine its top and bottom (C and D in Fig. 3.3, as well as C and D' calculated using view-invariant cross-ratio) until they are covered by a foreground region. For error tolerance, e.g., to cope with noises and occlusion, the intersection of all the refined 3D rods for each ground location from different camera views is adopted as the final line sample of possible human body.

Based on physical shape/size of a human body, we then apply the rules, as described in Subsection 2.2.1, to filter out incorrect 3D line samples obtained above. Also, the grouping procedure described in Subsection 2.2.2 is applied. Finally, for each group, the average location (maximum height) of the line samples is regarded as a person’s location (height).

3.3 Experiments

To evaluate our methods under different degrees of occlusion, we captured several video sequences of indoor and outdoor scenes. For each scene, calibration pillars are placed vertically and then removed from the scene for the estimation of camera centers, vanishing points, and multiple homographic matrices (see Appendix A). These sequences are captured with different

(32)

18

numbers and trajectories of people. The computation is performed with a PC under Windows 7 with 4 GB RAM and a2.4G Intel Core2 Duo CPU, without using any additional hardware.

Fig. 3.4 shows an instance of scenario S1 captured from four different viewing directions with a 360×240 image resolution. The average distance between the cameras and the monitored area is about 15m. One can see that the lighting conditions are quite complicated. The sun light may come through the windows directly and the reflections from the floor can be seen clearly. A total of 691 frames are captured for S1 wherein eight persons are walking around the ninth one standing near the center of the monitored area.

Figs. 3.5(a) and (b) show 2D line samples generated for Fig. 3.4(b) and the reconstructed 3D MAs, viewing from a slightly higher elevation angle, respectively. In addition, for a closer examination of the correctness of the proposed people localization and height estimation scheme, bounding boxes with a fixed cross-section, and with their height obtained from derived 3D MAs, are back-projected to the captured images, as shown in Fig. 3.5(c) for the image shown in Fig. 3.4(b). One can see that these bounding boxes do overlay nicely with the corresponding individuals. The recall and precision rates for the whole sequence are evaluated as 96.3% and 95.9%, respectively.

Fig. 3.6 shows similar localization results for scenario S2, which has the same people count as that for S1, but the nine people are walking randomly in the scene so that the occlusion among them becomes more serious. As a result, both the recall and precision rates are decreased slightly. To further examine the robustness of our method under serious occlusion, scenario S3 is evaluated, which is similar to S2 but having twelve persons randomly walking in the scene. Since the scene is becoming more crowded and serious occlusion may occur more frequently, foregrounds of different persons may easily merge into larger regions, as shown in Fig. 3.7(a). While satisfactory localization results are obtained in Figs. 3.7(b) and (c), the recall and precision rates for S3 are decreased to 91.9% and 90.0%, respectively.

The performance of the people localization approach described in this chapter is presented in Table 3.1. The precision and recall rates in all the three scenes are above 90%. Furthermore, the proposed approach achieves very high computational efficiency, even for the crowded scene S3, wherein 12 persons can be located quite accurately at a high processing speed of about 100 fps. For performance comparison, similar results of people localization obtained in [26] are listed in Table 3.2. One can see that the approach proposed in this chapter achieves similar precision and recall rates as in [26]. However, the processing speed is enhanced (about 2.6 times faster than [26]) due to the use of 3D line samples, instead of reconstructing 3D major axes via computing pairwise

(33)

19

(a) (b) (c) (d) Fig. 3.4. An instance of scenario S1, captured from four different viewing directions.

(a) (b) (c)

Fig. 3.5. Localization results for scenario S1. (a) Segmented foreground regions and 2D line samples for Fig. 3.6(b). (b) 3D major axes to represent different persons in the scene. (c) Localization results illustrated with bounding boxes.

(a) (b) (c) Fig. 3.6. Localization results, similar to those shown in Fig. 3.7, for scenario S2.

(a) (b) (c) Fig. 3.7. Localization results, similar to those shown in Fig. 3.7, for scenario S3.

Table 3.1. Performance of the proposed approach in this chapter.

Sequence Recall Precision Avg. error FPS

S1 96.3% 95.9% 12.16cm 30.74(0.47)

S2 95.2% 95.3% 10.94cm 32.06(0.52)

(34)

20

Table 3.2. Performance of people localization of [26].

Sequence Recall Precision Avg. error FPS

S1 92.0% 95.7% 11.60cm 11.62(1.008)

S2 94.9% 97.3% 10.00 cm 12.05(1.201)

S3 93.3% 94.3% 10.28 cm 8.34(1.025)

Fig. 3.8. Results of height estimation for S1.

Fig. 3.9. Results of height estimation for S2.

(35)

21

intersections of sample lines of image foreground projected at different heights.

The results of person height estimation for S1 are presented in Fig. 3.8, where red squares indicate the actual heights and blue dots represent the estimated heights together with intervals of unit standard deviations. One can see the errors are less than 5cm. Similar estimation results for S2 can be observed in Fig. 3.9. However, in Fig. 3.10, the results of height estimation of a person (P6) has an error of more than 10cm, which may result from more serious occlusion.

3.4 Summary

We propose an efficient and effective approach for people localization using multiple cameras. Enhanced from [26], we retain the advantage of vanishing point-based line sampling, and develop a 3D line sampling scheme to estimate people locations, instead of reconstructing 3D major axes via computing pairwise intersections of the sample lines at different heights in [26]. The computation cost is greatly reduced. In addition, effective height estimation is also proposed in this chapter. The experiments on crowded scenes, with serious occlusions, also verify the effectiveness and efficiency of the proposed approaches.

(36)

22

Chapter 4

Enhancement of line-based people localization

In this chapter, enhancement of the efficiency of the people localization approach described in Chapter 2 (see also [26]) is considered. The three major improvements include (i) more efficient 3D reconstruction, (ii) more effective filtering of reconstructed 3D line samples, and (iii) the introduction of a view-invariant measure of line correspondence for early screen. While (i) and (ii) are direct improvements/enhancement of the approach presented in Chapter 2, (iii) introduces a new way of measuring the correspondence of two line samples obtained in different views.

4.1 Efficient 3D line construction from intersection of two

triangles

While the approach described in Chapter 2 takes a lot computation time to calculate intersection points on multiple reference planes, as shown in Fig. 4.1 (left), an equivalent reconstruction of the 3D axis can actually be obtained by intersecting the two triangles8, as shown in Fig. 4.1 (right). By adopting such a method, the computational time, which does not depend on the number of intersection points (reference planes), is expected to be decreased greatly. Axis points can then be estimated by a direct sampling along the 3D axis if necessary.

4.2 Refinement and verification of reconstructed 3D line

samples

Although the rules of geometric filtering adopted in Chapter 2 are low-cost and effective, more filtering rules may be included to reduce miss detections. Since the two ends of a 3D line sample reconstructed above may be inaccurate, e.g., due to noise. We propose a refinement procedure to improve their precision. Additionally, two new rules are added, one before and the other after the refinement procedure, to increase the computation speed. Thus, the entire filtering procedure becomes more precise and effective. In particular the following new rule together with Rules 1-2, will be applied to a line sample right after the 3D reconstructoin,

3) The height of its top end point Ph t is lower than Tt l.

Fig. 4.2(a) shows line samples which survive Rules 1-3.

The main objective of the above three rules is to preserve two kinds of 3D line samples which correspond to (i) the full length of a standing/walking person or (ii) the head and torso of a

(37)

23

Fig. 4.1. Illustrations of the simplified 3D reconstruction.

person without his/her feet. By selecting appropriate thresholds, these three rules may also accommodate human activities such as jumping and squatting. In practice, these three rules can efficiently remove most of inappropriate 3D line samples, e.g., 84% of the originally reconstructed 3D line samples for the above example. However, since each 3D line sample is reconstructed by observations from two views only, the top and bottom ends of each 3D line sample may not be very accurate in position. To deal with such a problem, a refinement procedure using information from additional views, as described next, is adopted to find more accurate positions of the two end points before further verification of the 3D line samples are performed.

Conceptually, the refinement scheme is based on the fact that if a 3D line sample corresponds to a real person in the scene, its image in all views should be covered by foreground regions. In other words, its top and bottom end points will be covered by some foreground regions in all views. If that is not the case, the 3D line sample should be shortened until it falls within foreground regions in all views. Specifically, for each 3D line sample, we can use equally spaced sample points between its two ends Pht and Phb to form axis samples {Pht, …, Phb}9 (see (2.1) in Subsection 2.1.3). The refinement for the top end point corresponds to find the first sample point below Pht such that it is covered by some foreground regions in all views. Similarly, the refinement of the bottom end point can be done by searching in the upward direction from Phb.

After such a refinement (shrinking) procedure, Rules 1-3 can be applied again, as well as using another new rule,

4) The height of top end point Pht is higher than Tth.

to filter out inappropriate 3D line samples. One can see from Fig. 4.2(b) that rough people locations can be distinguished visually from the remaining 3D line samples. Finally, a threshold Tfg is used to filter out 3D line samples which do not have sufficient average

foreground coverage rate (AFCR), as shown in Fig. 4.2(c)10.

9 The interpolation spacing between two adjacent sample points corresponds to a total number of N

plane equally spaced reference planes between

the ground plane and the plane with 250cm in height.

10 In our implementation, each sample point of a 3D line sample is projected to all views to check if it is covered by foreground for the

(38)

24

(a) (b) (c)

Fig. 4.2. Filtering results of input images shown in Figs. 2.6(a)-(d). (a) The unverified 3D line samples which survive Rules 1-3, (b) the refined line samples which survive Rules 1-4, (c) final line samples (see text).

4.3 Early screening for line correspondence

In this section, we propose a line correspondence measure of 2D line segments in two different views which is based on a formulation of cross ratio. Such a quantitative measure is view-invariant and can handle line segment of arbitrary configuration in the 3D scene and will be applied to the people localization methods described in Section 4.1 to filtered out non-corresponding line sample pairs before 3D reconstruction. Therefore, the computation speed of the proposed people localization can be further improved. We also convert the formulation to a more efficient form for computational efficiency. While such a measure is first illustrated via the concept of 3D reconstruction, as shown in Fig. 4.3(a), for a better understanding the basic idea, we will show that the measure can actually be computed in either one of the two views.

4.3.1 A view-invariant measure of line correspondence

Assume we have a pair of line samples in View 1 and View 2, respectively, and homographic matrices H1π and H2π between the two views and the ground plane π can be obtained from camera

calibration. By projecting the line samples onto plane π, points A, B, C, and D can be obtained along a line in 3D space reconstructed by intersecting two planes each containing a camera center and the corresponding projected line sample. The lengths of AB and CD should be very small if

the two line samples correspond to the same 3D line segment. If L2 is projected to View 1 (as

' 'D

B in Fig. 4.3(b)) where A and C are end points of the line sample obtained in View 1, B and D can be calculated as intersection points of OB' and OD' and the line containing AC,

(39)

25

(a) (b)

Fig. 4.3. (a) Illustration the basic idea of the proposed correspondence measure of two line features (samples). (b) Illustration of a general form of the view-invariant cross ratio.

respectively, with O being the camera center of View 2 which is found in advance.

Instead of using the above lengths, whose values will vary with view points, the view-invariant cross ratio, in one of several forms as discussed in [29], can be used to evaluate the degree of line correspondence as

) )( ( ) )( ( OD OA OC OB OD OC OB OA CR      (3.1)

wherein each one of the four terms represents a signed triangular area in Fig. 4.3(b). If L1 and L2

correspond to a perfect match, points A and B (and points C and D) will coincide, and CR = 0. Moreover, since OAB/OAB'OBC/OB'COB/OB' and OCD/OCD'OAD/OAD'OD/OD', we have

' ' ' ' OAD C OB OCD OAB OAD OBC OCD OAB              (3.2)

and (3.1) can be calculated more efficiently by

) ' )( ' ( ) ' )( ' ( OD OA OC OB OD OC OB OA CR      (3.3)

since there is no need to compute B (D) from B′ (D′). Thus, the proposed view-invariant measure of line correspondence, with a zero value representing a perfect match11, can actually be evaluated in either one of the two views by first computing the homographic transform, e.g., 1

1

H H2 for

View 1 in Fig. 4.3(a), of two end points of a candidate line segment in another view.

(40)

26

4.3.2 Applying the line correspondence measure to improve the

efficiency of people localization

In this subsection, we will apply the proposed line correspondence measure to improve the efficiency of people localization Section 4.112. Instead of finding correspondence of realistic line features in the scene, we will verify whether 2D line samples from different views belong to the same person. Thus, computations associated with a 3D line sample which are clearly resulted from two line samples of different persons can be avoided. Such computations include (i) 3D reconstruction of 3D line samples, as the mentioned in Section 4.1, (ii) 3D validations and (iii) 2D (foreground) consistency check of the 3D line sample. For example, physical properties of a human body can be used to validate the heights of B and C, and the length of BC in Fig. 4.3(a) for (ii). As for (iii), if a person does exist in the scene, the image of the person should be covered by some foreground regions in all views, so points on each 3D line samples are back projected to all views for further verification. While the complexity of (ii) is very low once (i) is done, (iii) is very expensive since each of the back projection requires a computation of homographic transformation.

Fig. 4.4 shows the procedure of determining whether two line samples obtained from two different views are likely to represent the same person using various parts of (3.3). First, if the denominator of (3.3) is not greater than zero, i.e.,

) ' )(

'

(OBOC OAOD , (3.4)

the reconstruction from the two line samples will have zero length. Thus, we can conclude that the samples belong to different persons. Except for the special cases, which seldom occur in practice, that one end or both ends of the two 2D line samples are reconstructed coincidentally that the numerator of (3.3) is equal to zero13, (3.3) can be evaluated numerically to determine whether the reconstructed 3D line sample may result from the same person(s) that further refinements and verifications, e.g., (ii) and (iii), are needed.

Fig. 4.5 shows two numeric examples of the proposed line correspondence measure for some line samples shown in Figs. 2.6(b) and (d). While a small value (0.0034) is obtained for Fig. 4.5(a) where two line samples correspond to the same person, a larger value (0.0096) is obtained for Fig. 4.5(b) because of occlusion. A threshold of 0.01 is used in the experiments considered next to determine whether |CR| is small enough.

12 This is also true for the approach described in Chapter 2.

13 It is easy to see that in either case, which hardly occurs in practice, additional views are still needed to refine and verify the reconstructed 3D

(41)

27

Fig. 4.4. Procedure to determine whether two line samples are likely to represent the same person.

(a)

(b)

數據

Fig. 2.5. An example of overlap foreground and the estimated axis.
Fig.  2.7.  Grouping  and  localization  results.  (a)  Input  frame  532.  (b)  Grouping  sets
Fig.  2.9.  Processing  speed  (in  frame  rate  per  second)  of  (a)  Our  method.  (b)  The  generation  of  accumulated synergy map from all reference planes
Fig. 3.2. Finding candidate people blocks (CPBs) by two-layered grids. (a) Layer 1 grid
+7

參考文獻

相關文件

Reading Task 6: Genre Structure and Language Features. • Now let’s look at how language features (e.g. sentence patterns) are connected to the structure

compounds, focusing on their thermoelectric, half-metallic, and topological properties. Experimental people continue synthesizing novel Heusler compounds and investigating

Continue to serve as statements of curriculum intentions setting out more precisely student achievement as a result of the curriculum.

To complete the “plumbing” of associating our vertex data with variables in our shader programs, you need to tell WebGL where in our buffer object to find the vertex data, and

• The purpose of the teacher questionnaire is to solicit views of teachers on the initial recommendations at the subject level..

– Each listener may respond to a different kind of  event or multiple listeners might may respond to event, or multiple listeners might may respond to 

Therefore the existing transportation system has not been considering the characteristics of female users.This is an original study trying to investigate the differences

These family business owners have to face the following problem: Keep up with today's technology development from the original business equipments, whether to expand the scale of