Chapter 1 Introduction
1.3 Organization of thesis
In this thesis we propose a novel image transform based on the vanishing point of vertical lines in the scene to improve the performance of face detection for a surveillance camera with a common higher up mounting position. The flowchart of the proposed face detection approach in this thesis is shown in Figure 1.5. For each video frame, we first solve the image distortion problem with one of the proposed image transformations and then use the face detectors to get face candidates in the transformed image. To further reduce false alarms, we also propose to use skin color analysis to remove some candidates reported by the face detectors.
The rest of this thesis is organized as follows. In Chapter 2, we detail the image transform approach. In Chapter 3, we use skin color percentage in the detection window to reduce false alarms. The experimental results are shown in Chapter 4 and one can see the proposed methods can significantly enhance the detection performance. Finally, we give conclusions and discuss future works in Chapter 5.
6
Figure 1.5: Flowchart of the propose face detection approach.
Enough
Percentage of Skin Color?
Face Candidates Face Detected?
Image Rectification Video Sequence
Detected Faces
Skip Detection
Yes No
Ye s No
7
Chapter 2
Vanishing Point-based Image Rectification
In this chapter we present three different yet related transformations to rectify images having the perspective distortion caused in common surveillance applications. In Section 2.1, an intuitive transformation regarded as a base-line approach is proposed. A rectification result of Figure 1.2(c) and (d) can be seen in Figure 2.1(a) and (b). In Section 2.2, a transformation significantly improves the result from previous transformation that the aspect ratio of some objects is recovered as shown in Figure 2.1(c) and (d). Finally, a modification of the transformation in Section 2.2 for better performance according to our observation on practical applications is shown in Section 2.3.
(a) (b) (c) (d)
Figure 2.1: Rectification results of Figures 1.2(c) and (d). (a)-(b) Obtained from transformation 1 discussed in Section 2.1. (c)-(d) Obtained from transformation 2 discussed in Section 2.2.
For better understanding of the basic idea of the proposed method, we first describe how an object is captured by a surveillance camera. As shown in Figure 2.2, assume the camera is located relatively higher than the object and its optical axis is not horizontal. Suppose there is a square billboard standing vertically in front of the camera, with the pattern on the billboard shown in Figure 1.2(b), the image of the billboard will have perspective distortion, as shown
8
in Figure 1.2(d). Furthermore, the extension of vertical lines on the billboard will meet at a single point in the image plane.
For simplicity, in the following discussion we assume that the image center is the origin of the image plane, and the VPVL, V = ( ), can be derived from images of vertical lines via simple camera calibration. Additionally, each image is assumed to be rotated in advance such that .
Figure 2.2: The relation between the camera, a standing billboard, and the ground plane, with the camera represented by its lens for simplicity. The billboard is assumed to be facing the vertical line containing the camera center.
9
2.1 Transformation 1
For the surveillance system described in Figure 2.2, we find that all square patterns on the billboard, B, are transformed to trapezoids in the image plane, P, as shown in Figure 1.2(d). An intuitive way of image rectification is to make all lines pointing to VPVL vertically parallel. To this end, we propose our first transformation and explain it as follows. Consider Figure 2.3, where we would like to make an image, Figure 2.3(b) from the original image, Figure 2.3(a). In particular, we would like to transform two horizontal line segments, and , in Figure 2.3(a) into and in Figure 2.3(b), respectively such that and are still horizontal and have the same length where O and o are the image center of Figure 2.3(a) and (b). According to triangle geometry and simple calculation, we have
(2.1)
(2.2)
In general we can define a transformation such that for each pixel (x, y) in Figure 2.3(b), its content can be obtained at position (X, Y) in Figure 2.3(a) such that
(2.3)
In Figure 2.1(b) we can see that each trapezoid can be recovered as a rectangle, but not a square. Moreover, a white block near the top is much larger than that near the bottom, resulting in rectangles of different heights. At last Figure 2.4 shows an example of image rectification obtained with transformation 1.
10
(a) (b)
Figure 2.3: An illustration diagram for transform 1. (a) Image before rectified. (b) Image after rectified.
(a) (b)
Figure 2.4: (a) Origin image. (b) Rectified result of transformation 1.
11
2.2 Transformation 2
In order to further recover the aspect ratio of an object, a novel image transformation is presented in this section. To simplify the presentation, we first rotate the camera model shown in Figure 2.2 to make the optical axis horizontal as illustrated in Figure 2.5, where θ is the tilt angle, f is the focal length of the camera, l is a vertical line connect the optical center (O) and the ground, and is distance from the optical center to the pixel which is the intersection of the optical axis and the billboard (B). Furthermore, we set the origin of the 3D coordinate system to be the optical center, with the z-axis to the right, the y-axis pointing downward, and x-axis pointing out of the paper.
Figure 2.5: A rotation of Figure 2.2, to make the optical axis of the camera horizontal.
As shown in Figure 2.6, the value of the unknown parameter f can be found as
√ (2.4)
where F is the focal length in standard metric units, L is the size of the image sensor, and is the size of image width and height.
12
Figure 2.6: An illustration showing the relationship between the image sensor and the image.
To investigate the relation between the billboard (B) and the image plane (P), let
(2.5)
where represents the 3D coordinates and represents 2D coordinates of a pixel on B.
Moreover, let denote the 3D coordinates of a pixel on the image plane (P), we have
(2.6)
Because P is parallel to the XY-plane of the 3D coordinate, is also the 2D coordinates of this corresponding pixel on P. Thus we can define an image transformation from P to a recovered image R as:
13
where with being the coordinates of the transformed pixel. Finally, an example of image rectification of transformation 2 is shown in Figure 2.7.
In this transformation, affects only the scale of transferred image and can be assigned with an arbitrary value per user’s requirement. On the other hand, for the remaining unknown θ in equation (2.7), we can derive and as:
√
√
(a) (b)
Figure 2.7: (a) Origin image. (b) Rectified result of transformation 2.
14
2.3 Transformation 3
One characteristic of transformation 2 is that it is only valid for billboards parallel to the XY-plane in the 3D coordinate. However, in most applications we are interested in all people facing to l in Figure 2.2. For example, if a camera is installed above an entrance, we are more interested in people walking toward the entrance. Besides, people facing l also provide more frontal face information and are thus more valuable. To make the proposed method more suitable for any billboard facing l, we further propose a modified version, namely transformation 3, of transformation 2 such that after the transformation, all pixels on image plane (P) with the same distance to VPVL will be mapped to the same horizontal line in the recovered image. The difference between results of transformation 2 and 3 is also illustrated in Figure 2.8. In Figure 2.8 an image containing a person facing to l but not to the XY-plane is rectified by transformation 2 and 3 respectively. For better illustrations, we draw a green line on the person's eyes. It is obvious that the green line in the result of transformation 2 (Figure 2.8(a)) is not horizontal while in Figure 2.8(b) the green line is nearly horizontal and thus the face in Figure 2.8(b) is more close to a normal face.
(a) (b)
Figure 2.8: Rectified image of Figure 1.1(b) by transformation 2 and 3. (a) Result of transformation 2. (b) Result of transformation 3.
15
Figure 2.9 explains how the corresponding pixels in the original image can be found according to the above description of transformation 3. In Figure 2.9 several corresponding lines and pixels in transformation 3 (Figure 2.9(b)) and in the original image (Figure 2.9(a)) are illustrated. Specifically, the and represent the lines with in the original image and transformation 3, respectively. Now we show that for the pixel in Figure 2.9(b), how to find its corresponding pixel in Figure 2.9(a). Given the pixel , we first find the pixel which is located at the central vertical line and has the same y-coordinate as . According to transformation 2 we know that comes from the pixel in the original image. Then we rotate around the VP by which is proportional to the distance, , between and as :
[
] [ ] [
] [ ] (2.9)
(a) (b)
Figure 2.9: Relation between original image and rectified image by transformation 3. (a) Original image. (b) Rectified image by transformation 3.
16
Notice that and are both 0 and hence can be ignored. After combining Equation (2.9) as well as equation (2.7) of the transformation 2, we derive the formulation of transformation 3:
( ) (
) ( ) (
)
where represents a pixel on rectified image and represents a pixels on plane P. An example of the transformation result is shown in Figure 2.10, where in Figure 2.10(a) the red curve with all pixels on it is equidistant to the VPVL is mapped to a straight line in Figure 2.10(b).
(a) (b)
Figure 2.10: (a) Origin image. (b) Rectified result of transformation 3.
17
Chapter 3
Method for Reducing False Alarms
In this chapter we try to reduce false alarms for better detection performance. We propose to use skin detection for its efficiency. The skin detection is carried only on regions of candidates found by face detectors to further reduce the computation demand. In Section 3.1, the skin detection adopted in our system in introduced. The complete process and the final results are present in Section 3.2.
3.1 Review of Skin Detection
In the work proposed by Garcia et al. [16], a skin color sub-space in YCbCr space is figured out for detection of skin regions in MEPG streams and JPEG images. A data set containing 950 skin colors samples is used in order to approximate the color sub-space.
These samples are extracted from various still images and video frames, covering a large range of skin color appearance caused by different races and different lighting conditions.
Different from prior work, Garcia et al. propose to use varying intensity (Y) values to deal with strong lighting variations as they notice that the skin color distribution turns out to be different when Y changes extremely. More specifically, for the purpose to approximate the
18
distribution borders in the extreme light and dark cases, the authors actually propose two groups of planes equations depending on two areas of the color space, separated by the horizontal plane Y = 128. Their proposed planes equations are present in equation (3.1) while two skin color sub-spaces defined by equation (3.1) according to two different Y values are shown in Figure 3.1.
(a)
(b)
(c)
Figure 3.1: Skin color region define by [16] at Y = 110 and Y = 140. (a) YCbCr plane at different Y value. (b) Skin region define by equation 3.1. (c) Skin colors at different Y value.
19
3.2 Skin Color Percentage in Candidate Regions
(a)
(b) (c) (d) (e) (f)
Figure 3.2: (a) A picture with five candidate regions. (b) 0% skin color percentage in region 1. (c) 33% skin color percentage in region 2. (d) 0% skin color percentage in region 3. (e) 79% skin color percentage in region 4.
(f) 0% skin color percentage in region 5.
In this section, we demonstrate how to reduce false alarms with skin detection and get a better detection performance. For each image frame, we first find several candidates by the face detector. Then we exploit the skin detection method mentioned in section 3.1 to find skin pixels among regions of these candidates. Finally a candidate is taken out if the percentage of skin pixels in the region under it is less than a threshold. Take Figure 3.2 for example, in Figure 3.2(a) five candidates are detected and marked with red squares. For each candidate the skin pixels are detected and the percentage of them is recorded in the caption of Figure
20
3.2(b)-(f). Therefore, we may take out some bad candidates if we can find a suitable threshold for skin pixel percentage. In order to determine such a threshold that can separate true faces and false alarms properly, two videos are tested, i.e., S10 for indoor environment and O5 for outdoor environment, as shown in Figure 3.3. Table 3.1 and Table 3.2 show face detection results for different threshold values for the two videos. One can see that when the threshold is set between 10% and 40% for indoor environment (between 15% and 35% for outdoor environment) best detection results can be achieved. With such thresholds, we keep almost as many true positives as the origin while reduce most false positives.
And we set the threshold to 40% and show the comparison of face detection results on another video with and without the false alarm detection in Figure 3.4. One can see that with false alarm detection most false positives have been removed while the true faces are still preserved.
(a)
(b)
Figure 3.3: Videos used to find the threshold of skin pixel percentage. (a) Video S10. (b) Video O5.
21
Table 3.1: The detection results for S10 with different thresholds of skin pixel percentage.
TP FP FN
origin 80 31 0
P>0% 80 9 0
P>5% 80 2 0
P>10% 80 0 0
P>15% 80 0 0
P>20% 80 0 0
P>25% 80 0 0
P>30% 80 0 0
P>35% 80 0 0
P>40% 80 0 0
P>45% 78 0 2
P>50% 63 0 17
P>55% 41 0 39
P>60% 19 0 61
P>65% 0 0 80
P>70% 0 0 80
P>75% 0 0 80
P>80% 0 0 80
P>85% 0 0 80
P>90% 0 0 80
P>95% 0 0 80
P=100% 0 0 80
22
Table 3.2: The detection results for O5 with different thresholds of skin pixel percentage.
TP FP FN
origin 115 184 0
P>0% 115 80 0
P>5% 115 56 0
P>10% 115 6 0
P>15% 115 0 0
P>20% 115 0 0
P>25% 115 0 0
P>30% 115 0 0
P>35% 115 0 0
P>40% 112 0 3
P>45% 107 0 8
P>50% 102 0 13
P>55% 96 0 19
P>60% 85 0 30
P>65% 78 0 37
P>70% 70 0 45
P>75% 39 0 76
P>80% 14 0 101
P>85% 6 0 109
P>90% 0 0 115
P>95% 0 0 115
P=100% 0 0 115
23
Figure 3.4: Reducing false alarms with 40% skin color in candidate regions method in different path images.
24
Chapter 4
Experimental Results
In this chapter, we give four experiments to show how face detection can be improved with the proposed method. We have tested our method on videos under three real environments as well as on some synthesized images. In Section 4.1, the specifications of cameras used in our experiments and the installation parameters are detailed. In Section 4.2, we show some image samples from our testing videos. An experiment for comparison is present in Section 4.3. In this experiment, the videos are simply rotated such that the new position of the VPVL is at the horizontal center. The face detection results of our method with or without reducing false alarms method on videos under real environments are reported in Section 4.4 and in Section 4.5, respectively. Finally, in Section 4.6 we show the experiments on synthesized images with multiple camera installation settings.
4.1 Environment Settings
Through our experiments two cameras, named as AXIS 207MW and VIVOTEK FD8161 according to their model names, are used. In the indoor experiments, the AXIS 207MW is attached at the ceiling without any calibration and is located about 2.7 meters above the ground plane, as shown in Figure 4.1. In the outdoor experiments, we install the VIVOTEK FD8161 above an entrance also without any calibration. The height of the camera is about 2.5 meters, as shown in Figure 4.2.
25
(a) (b)
Figure 4.1: Camera AXIS 207MW settings. (a) Experimental environment. (b) Close-up view of camera.
(a) (b)
Figure 4.2: Camera VIVOTEK FD8161 settings. (a) Experimental environment. (b) Close-up view of camera.
26
4.2 Video Demonstration
We have captured 25 videos and use them in our experiments. As mentioned previously, our experiments contain videos both on an indoor environment and on an outdoor environment. These videos have different resolutions according to the camera models and thus have different resolutions on the two environments. In our experiments the resolutions of the indoor and outdoor videos are 640x480 and 800x600, respectively. In all videos one or multiple people walking and roughly toward the camera are captured. Furthermore, these people start walking from different positions so that we can see faces with different sizes, rotations, and distortions, in these videos.
(a) (b) (c) (d)
(e) (f) (g) (h)
(i) (j)
Figure 4.3: Frame examples of a single person walking in the laboratory. (a) Video S1. (b) Video S2. (c) Video S3. (d) Video S4. (e) Video S5. (f) Video S6. (g) Video S7. (h) Video S8. (i) Video S9. (j) Video S10.
27
(a) (b) (c) (d)
(e) (f) (g)
Figure 4.4: Frame examples of multi people walking in the laboratory. (a) Video M1. (b) Video M2. (c) Video M3. (d) Video M4. (e) Video M5. (f) Video M6. (g) Video M7.
(a) (b) (c) (d)
(e) (f) (g) (h)
Figure 4.5: Frame examples of a single person walking in the outdoor environment. (a) Video O1. (b) Video O2.
(c) Video O3. (d) Video O4. (e) Video O5. (f) Video O6. (g) Video O7. (h) Video O8.
28
4.3 Experiment 1 – Face Detection using Image Rotation
In this section, we present a simple method using only image rotation to improve face detection with a surveillance camera. Based on the observation that faces in directions other than the front of the camera seem like faces with in-plane rotations, to treat them just like in-plane rotated faces like [14, 15] is an intuitive way. In [14, 15], faces with rotations in different degrees are detected by rotated detectors. Unlike them, for easier implementation, we instead rotate the image and repeat the detection procedure several times.
To decide the amount of the angular shift in each face detection procedure, we first investigate the limitation of the frontal face detector, which is proposed in [17] and provided in OpenCV, on in-plane rotations. A subset of the MUCT face database [18] is used in this testing. More specifically, we use images captured by the camera "e" as shown in Figure 4.6.
There are total 751 images in the database and each subject is captured under 2 to 3 different lightings as shown in Figure 4.7. In our testing each image is rotated 360 times with the step of one degree and then the face detection results are carried out on all rotated versions. The detection result of an image with all its rotated versions is shown in Figure 4.8 where the maximum continuous region in which the face detection succeeds is of 48 degree. With the same procedure on all images, we have the result that the smallest maximum continuous region is of 28 degree and the average maximum continuous region is of 47.47 degree.
Figure 4.6: The relationship between five cameras and human face in MUCT.
29
(a) (b) (c)
Figure 4.7: Three different lighting for a person in MUCT face database captured by camera e.
Figure 4.8: The testing result of Figure 4.7(a). 1=face was detected; 0=no face was detected. 22-(-25)+1=48, 48 is the maximum rotation in plane degree result for Figure 4.7(a).
Although the face detector is proved effective for faces rotated up to ±23.7 degree (half of the average result 47.47), we take a relatively conservative rotation step of 15 degree in our experiments. The maximum in-plane rotation of faces in our videos is about +24.8 degree, thus it should be enough to rotate the images by ±15 degrees to deal with all in-plane rotations occurred here. However, we also rotate images by ±30 degree and merge the results for more robustness. An example of face detection on an image in rotations of different degrees is shown in Figure 4.9. In Figure 4.9 the face detector can detect only one face in the original image. In the meantime, the remaining two faces are detected in the -15 and -30 degree rotated versions separately. It is obvious that more faces can be found if we provide more rotated versions of an image.
30
(a) (b) (c)
(d) (e)
Figure 4.9: Example of rotating image. (a) Origin image. (b) Rotate -15 degree. (c) Rotate 15 degree. (d) Rotate -30 degree. (e) Rotate 30 degree.
By combining results in different rotations, we have the result shown in Figure 4.10. One can see that although more faces are detected, the number of false positives is also increased.
(a) (b)
Figure 4.10: Combine the detection results of Figure 4.9 into one image. (a) Combine Figure 4.9(a)-(c) and 2 TP/FP in the image. (b) Combine Figure 4.9(a)-(e) and 3 TP/FP in the image.
31
Finally we show the detection results in Table 4.1 ~ Table 4.3. Where O represents the origin method without any preprocessing on images; R1 represents combining the results on the origin images and on the 15-degree rotated version; and R2 represents combining results on the origin images, 15-degree rotated version and 30-degree rotated version. In the table, we give precision and recall to show the detection performance. One can see that the more rotated images are added, the higher recall and lower precision will be. In Table 4.4 we give the average results across all videos. It is obviously to see that R2 has the best recall but the lowest precision in most groups.
Table 4.1: Face detection results of S1 ~ S10 videos by image rotation.
S1 S2 S3 S4
Recall Precision Recall Precision Recall Precision Recall Precision
O 100%
Recall Precision Recall Precision Recall Precision Recall Precision
O 1.8%
Recall Precision Recall Precision
O 95.5%
32
Table 4.2: Face detection results of M1 ~ M7 videos by image rotation.
M1 M2 M3 M4
Recall Precision Recall Precision Recall Precision Recall Precision
O 42.2%
Recall Precision Recall Precision Recall Precision
Recall Precision Recall Precision Recall Precision