1.1 Motivation
Recently, the proliferation of security surveillance cameras necessitates the development of automatic/semi-automatic surveillance system with the assistance of computer technology.
Therefore, the research on vision-based people localization has been gaining popularity. In more recent years, there has been a tremendous wave of interest in people localization for crowded scenes. Serious occlusions may occur frequently within a group of people in a real-world environment. Based on current research, there is still scope for accuracy and efficiency improvements in solving occlusion problems.
Conventional people localization approaches are based on single-camera monitoring. A target object can be successfully detected with a single static or moving camera if it is neither occluded by nor occluding others in the scene. However, this kind of monocular approach may not achieve high accuracy under serious occlusion. An example is shown in Figure 1.1 and Figure 1.2; the two binary foreground images are obtained from the original images by
Figure 1.1 An example of isolated people in frame 185. (a) The frame before occlusion occurs.
(b) The binary foreground image of (a).
2
Figure 1.2 An example of serious occlusion in frame 215. (a) The frame shows the person dressed in red jacket is occluded. (b) The binary foreground image of (a).
background subtraction. With difference of 30 frames between the two figures, the circled foreground region in Figure 1.1(b) can be clearly recognized as an isolated person, but it is hard to distinguish the region in Figure 1.2(b) as two people due to the serious occlusion.
To overcome the limitations, vision-based localization and tracking have shifted from monocular approaches to multi-camera approaches since the latter may handle serious occlusion better by using more information. An example of multi-camera localization is shown in Figure 1.3, with 9 persons in four views of the same scene. The four views contain 3, 4, 2, and 3
Figure 1.3 Multi-camera approach provides sufficient information for people localization. (a) The binary foreground images from four views of the same scene. (b) The localization result
3
obtained from (a) by using our method.
foreground regions with serious occlusion, respectively. But by using multiple views of the same scene, the localization recovers information that might be missing in a particular view and achieves good results under serious occlusion as shown in Figure 1.3(b).
However, multi-camera approach increases the amount of information from additional views and leads to much higher computational complexity. Our purpose is to propose an efficient and effective approach for people localization using multiple cameras, which can handle serious occlusion in a crowd scene and provide real-time performance without special hardware.
1.2 Review of Related Works
In the last decade, a considerable amount of approaches for people localization and tracking have been dedicated to effectively dealing with occlusion problem. Traditional single-camera-based monocular approaches [1]–[3] for people localization often cannot achieve high accuracy due to the limited viewpoint and cluttering issue, i.e., a person in one view might be partially or completely occluded by other people. To overcome these limitations, many latest people localization schemes adopt multiple cameras [4]–[9].
Hu et al. [4] propose a method using people axes, wherein each person is represented by an axis, to estimate the feet points in images. Before the determination of the principal axes of people, the foreground regions need to be predefined for an isolated person, a group of people or occluded people. Since the principal axis-based method highly relies on the accuracy of object classification step which distinguishes the three situations of foreground regions, this approach may not work well for dense crowd.
Instead of using shape cues or color models to analyze foreground regions in [4], Khan et al. [5] propose a people tracking work which neither detects nor tracks objects in any single
4
camera or camera pairs. The proposed method projects and integrates foreground likelihood information of all image pixels, which is captured from different views, on multiple reference planes of different heights to form an occupancy probability. Different from the method in [5], which performs the reconstruction in three dimensions, the methods proposed by Fleuret et al.
[6] and Alahi et al. [7] only use the occupancy map on grids of the ground plane, which is measured by back-projecting a predefined model, e.g., a rectangle, to image planes for occupancy computation. Without correspondences of people between different views, approaches presented in [5]–[7], which have high complexity in computation due to the pixel-based processing, perform quite well under serious occlusions. However, such methods are not suitable for certain surveillance applications, such as intruder detection and abnormal behavior detection wherein people localization is only part of the complete process, which need prompt attention and demand for very high processing and response speed.
In [9], Lo and Chuang propose an efficient vanishing point-based line sampling technique for people localization with near real time performance to avoid projecting all foreground pixels of multiple camera views to all reference planes. The computational complexity is reduced from pixel-based to line-based processing. Multi-plane homography is used to obtain pairwise intersections of the line samples at different heights. Then the vertical line samples in the 3D scene can be reconstructed for people location estimation.
In this study, we continue to use the vanishing point-based line sampling technique in [9].
The efficiency of the above line sample-based approach is further improved in our method.
Without multi-plane projection for reconstruction in three dimensions, we consider only one reference (ground) plane to analyze footsteps of people, resulting in significant improvement in computational efficiency. Experimental results show that comparable accuracy in people localization can be achieved with ten times in computing speed compared with our previous approach [9].
5
1.3 Overview of Proposed Methods
In this study, we propose an efficient and effective approach for people localization using multiple cameras. Figure 1.4 illustrates the schematic diagram of the proposed framework. First, the preprocessing procedure of camera calibration is executed to find the vanishing point of vertical lines in the scene for each image plane. Next, we generate lines originated from such a vanishing point to sample the foreground objects (people) in each camera view, as in [9]. The line samples of foreground objects from all camera views are then projected onto the ground plane via homographic transformation, with regions crossed through by a large number of projected sample lines identified as candidate people locations.
We then generate (vertical) 3D line samples for these candidate people locations. After a refinement/verification procedure for these 3D line samples, the height of each person can also be estimated. Finally, the remaining 3D line samples are clustered into individual axes to indicate people locations and heights.
6
Figure 1.4 Schematic diagram of the proposed people localization framework
7
1.4 Contributions of This Thesis
In this study, we propose an efficient and effective method capable of locating a crowd of dense people in real time, using multiple cameras. We retain the advantage of vanishing point-based line sampling proposed in [9]; foreground features such as color models or shape cues are not needed. Furthermore, we develop a 3D line sampling scheme for a single reference ground plane to estimate people locations, instead of performing reconstruction via computing pairwise intersections of the sample lines at different heights as in [9]. The computational efficiency of the proposed method achieves up to 180 frames per second. For intruder detection and abnormal behavior detection to function properly wherein people localization is only part of the complete process, our approach may help to provide prompt attention with very high processing and response speed. Experiments show satisfactory recall and precision rates can be achieved by the proposed method under serious occlusion for some crowded scenes in the real world.
1.5 Thesis Organization
The remainder of this thesis is organized as follows. In Chapter 2, we explain how to generate 2D line samples in multi-view based on vanishing points. In Chapter 3, a two-layer grid occupancy map is generated by projecting the above 2D line samples on ground for footstep analysis which estimates candidate people locations. In Chapter 4, 3D line samples are generated from these candidate people locations. Refinement/verification scheme is then developed to validate each 3D line sample. Experimental results with reasonable performance in people localization in terms of accuracy and efficiency are given in Chapter 5. Finally, conclusions of our study and suggestions for future works are given in Chapter 6.
8