3.2 Review of Perspective Projection
Perspective projection is a phenomenon conveyed by a classical pinhole camera.
A pinhole camera coordinate system is shown in Figure 3.6. The vector plane, which
is formed by vectors x and y, is parallel to the image plane Π'. The point C' where it pierces Π' is called the image center, and the distance between C' and O is f'.
Let P denote a scene point with coordinates (x, y, z) and P' denote its image with coordinates (x', y'). Since the three points P, O, and P' are collinear, we get
' ' '
x y f
x = y = z , and so by the triangulation principle, we have
' ' ,
As shown in Figure 3.7, consider the fronto-parallel plane Π0 defined by z = z0. For any point P in Π0 we can rewrite the perspective projection Equation (3.1) as
' to the world coordinate system, it is impossible to figure out the distance between eyepoint and the point P due to the inherent ambiguity of the light ray projection.
However, the point P', the projection of the point P, can be represented by the longitude and the latitude of the point P in the real world. For example, let the longitude and the latitude of the point P be θ and φ, respectively. Then the point P' can be presented as well by the pair of the longitude and the latitude (θ, φ), as shown in Figure 3.7.
Figure 3.6 Concept of perspective projection.
Figure 3.7 Image coordinate system mapping into real world.
3.3 Calibration of Camera by Angular Mapping
Each pixel in the image can be represented by a longitude value and a latitude value. With no camera distortion, each pair of longitude and latitude values is linearly mapped to a pixel in the image. It means that if we get a pixel P(u, v) in the image coordinate system, according to the linear mapping, we can get the longitude of P via u and the latitude of P via v independently. Unfortunately, this assumption of no camera distortion is ideal, and the coordinate transformation from the real world to the 2D image is nonlinear. We have to considerate the u-coordinate and the v-coordinate at the same time to get the longitude and latitude values of the pixel in the image. We will introduce the concept of a linear angular transformation which is the basic idea of the proposed calibration method described in Section 3.3.1. To correct the effect of the distortion, we will propose a nonlinear angular mapping method in Section 3.3.2.
3.3.1 Linear Angular Transformation
First of all, assume the swing angle of the camera is zero and the pan and tilt angles, α and β, of the camera are known. The horizontal and vertical resolutions, W and H, of the image are also known, as shown in Figure 3.8. We define the image coordinate system as shown in Figure 3.9. Assume also that the center of the imaging screen and the center of the lens coincide, i.e. the image coordinates (0, 0) specify the center of the image. Using the information mentioned before, we can define each pixel in the image coordinate system by the longitude and the latitude values. We use the vector (θ, φ) to represent the corresponding longitude and latitude pair of the pixel P(u, v) in the image. The detailed process of angular coordinate transformation is
described in the following algorithm.
Algorithm 3.1 Angular coordinate transformation by trigonometric function.
Input: An image I with resolution W×H, and the horizontal and vertical viewing angles α and β, respectively, of the imaging camera.
Output: A longitude set { , ,..., , , ,..., , }
Figure 3.8 The relationship of image resolution and the viewing angles. (a) The horizontal direction. (b) The vertical direction.
(0, 0) (0, W/2) (0, -W/2)
(H/2, 0)
(-H/2, 0) image
u v
Figure 3.9 An illustration of the image coordinate system confined by the ranges of image resolution.
3.3.2 Nonlinear Angular Mapping
With the camera distortion, the linear angular transformation method mentioned in Section 3.3.1 is not applicable to get the correct corresponding angles θi and ϕi of the image coordinate system. To precisely obtain the angular transformation from the real world to the image, a real world data acquisition method by angular-mapping camera calibration is proposed. Since camera distortion exists both horizontally and vertically, we have to consider the horizontal and vertical directions at the same time while we compute the longitude and latitude values of each point in the image.
In the proposed method, we attach a grid with m vertical lines and n horizontal lines on a wall which is perpendicular to the ground. Then we have a real world point set V = {V00, V01, …, Vmn}, where Vij = (θij, φij) is a pair of the longitude and latitude values in the SCS of the point Vij at the intersection of the ith vertical line and the jth horizontal line. The set V of intersection points is known in advance. And the corresponding point set P = {P00, P01, …, Pmn} appearing in the image may be
identified manually, where Pij = (uij, vij) is a point in the ICS corresponding to point Vij. The detailed process of the previously-mentioned nonlinear angular mapping is described as an algorithm in the following.
Algorithm 3.2 The real location data acquisition by image taking and mapping.
Input: An image I, as shown in Figure 3.11, and a set of longitude and latitude pair V
= {V00, V01, …, Vmn}, as mentioned above.
Output: A point set P = {P00, P01, …, Pmn} in I corresponding to V, with Pij
corresponding to Vij, where i = 0, 1, …, m and j = 0, 1, …, n.
Steps:
Step 1. Attach a grid with m vertical lines and n horizontal lines on a wall, which is perpendicular to the ground.
Step 2. According to the interval distance of the grid on the wall and the distance Dic
from the wall to the camera, measure the longitude and latitude values of each point Vij in the set V.
Step 3. Fix the interval of the longitude and the latitude to be 5º by adjusting the interval distance of each vertical line and each horizontal line of the grid on the wall based on the constraint of Dic = 170cm as shown in Figure 3.10.
Step 4. Mark yellow points at the intersections of the lines, as shown in Figure 3.11.
Step 5. Record the coordinates of each yellow point Pij(uij, vij) in the ICS and group all such points as a set P.
Step 6. For each point Pij in P in the image, manually identify the corresponding point Vij in V with the longitude and latitude values θij and φij, as shown in Figure 3.12 and set up the mapping.
We have known the longitude and the latitude values of the yellow points in the
other pixels in the image, we use an interpolation method, as described in the following algorithm.
Algorithm 3.3 Interpolation for computing viewing angles of any point.
Input: An image point I(u, v) in the ICS, the point set P and the point set V
Step 2. Decide whether the point I is in the region surrounded by the coordinates (u, v) of the four endpoints, Pij, P(i+1)j, Pi(j+1), and P(i+1)(j+1) by substituting If the inequalities (3.1) and (3.2) are satisfied, the point I is regarded to be in the region; else, repeat Step 2 to check the next region.
Step 3. Define a line Mh which passes the point I and its slope is the average of the slope of L1 and L3, and so obtain two intersections q(uq, vq) and r(ur, vr) of
Mh with L0 and L2 as shown in Figure 3.13.
Step 4. Define a line Mv which passes the point I and its slope is the average of the slope of L0 and L2, and so obtain two intersections s(us, vs) and t(ut, vt) as shown in Figure 3.13.
Step 5. Use an interpolation method to obtain the longitude and the latitude (θI, φI) of I in the SCS by the following equations according to the geometric ratio principle:
Figure 3.10 An illustrate of Attaching the lines on the wall.
By the interpolation method, each pixel in the image coordinate system can be mapped into the longitude and the latitude values in the SCS. By this information, we can get the angular position of objects in the image, as described in the following.
(a) (b)
Figure 3.11 A method of finding image coordinates of tessellated points in the grabbed image. (a) A grabbed image with tessellated points. (b) The tessellated points marked by yellow points.
Figure 3.12 The points on the wall corresponding to of yellow points in Figure 3.11(b).
Pij
Figure 3.13 An illustrate of the interpolation method that a region contains the point I in the ICS.
3.4 Vehicle Location Techniques Using Angular Mapping
Using the calibration by the non-linear angular mapping method mentioned in Section 3.3.2, we can get the longitude and the latitude values of each pixel in the image. Since the camera is equipped on the arm of the vehicle, the directional angles of the camera are not always zero. When the pan angle of the camera is not zero, the directions of an object with respect to the camera and the vehicle also are both different. To track the target object correctly, we have to transform the directional angles with respect to the camera to ones with respect to the vehicle. In order to obtain the transformation, the information of the longitude and latitude values which are obtained from mapping the image coordinates is not enough. We must have more data to solve the ambiguity in distance estimation.
First, we have to know how to calculate the distance between the object and the camera by the latitude value and the distance from the camera to the ground, and we will discuss our method for this purpose in Section 3.4.1. And the way we propose for computing the directional angles of the camera with respect to the vehicle will be stated in Section 3.4.2.
3.4.1 2D to 3D Distance Transformation
As shown in Figure 3.14, we knew the coordinates of an object in the ICS after we take the image of the object. After the angular mapping, we have the information of the longitude and the latitude of each image point of the object in the SCS. Since we have the latitude value of the object and the knowledge about the height of the camera, if we know the ground contact point of the object in the image, we can compute the distance between the camera and the object. The algorithm is described in the following.
Algorithm 3.4 Transformation of 2D image point to 3D real world point.
Input: Camera height Hc, and the ground contact point P(u, v) in the ICS of an object.
Output: The distance Doc between the object and the camera.
Steps:
Step 1. Transform (u, v) of P into its (θ, φ), the longitude and the latitude in the SCS, by the proposed mapping method described in Section 3.3.2.
Step 2. Compute Doc by the following equation:
OC
Since we know the distance between the vehicle and the object by applying the above algorithm, what we have to do now is to turn the direction of the vehicle toward the object and moves forward. The way we propose to find the angle the vehicle has to turn will be introduced in the next section.
(a)
(b)
Figure 3.14 The distance between the object and the vehicle. (a) The camera has no tilting, i.e. φc = 0. (b) The camera has a tilt angle of φc.
3.4.2 Angle Transformation between Coordinate Systems
As shown in Figure 3.15, we can know the point of the object in the image and get the distance from the vehicle to the object according to the above algorithms.
However, the rotation center of the camera is different from the one of the vehicle.
angle θv, the angle that the vehicle has to turn to aim at the object, in the PCS. The transformation from (u, v) in the ICS to θv in the PCS is described in the following algorithm.
Algorithm 3.5 The angular transformation from the ICS to the PCS.
Input: The distance between the rotation center of the camera and the vehicle, and the ground contact point (u, v) of the object in the ICS.
Output: The directional angle θv of the object in the PCS Steps:
Step 1. Transform (u, v) to (θ, φ), the longitude and latitude in the SCS, by the proposed mapping method in Section 3.3.2.
Step 2. Compute θv in the PCS by the following equation according to Figure 3.15:
cv
Figure 3.15 The rotation angle of the vehicle to obtain the tracking of the object.
From the above algorithm and the one in the last section, we can transform the position of the object in the image to the polar coordinate system with the distance and directional angles (Dic, θv). By the transformation, the location of an object in the real word now is clear and can help us to conduct tracking of target objects.
Chapter 4
Human Detection by Image Analysis for Indoor Security Patrolling
4.1 Overview of Human Detection
There are many kinds of features and sensors to detect human beings. Since visual perception is the only sensing capability of the proposed system in this study, image analysis is one of the solutions to detect human beings. The face is an obvious characteristic of human beings. As the result, we propose a method to detect human faces by color and shape features in images. The method for face detection will be described in Section4.3. Sometimes, the limitation of camera resolution makes the acquired image unclear. A far distance from a person to the camera might cause difficulty in segmenting a clear human face region out of an image of the person. To redeem the limitation, we propose a blockwise frame difference method to extract moving objects in the image and decide if the moving object is similar to a human body. The motion detection method will be proposed in Section 1.4. Before all the details of the mentioned techniques are described, we will give a brief introduce to the proposed process in Section 4.2 first.
4.2 Proposed Process of Human Detection
The proposed process of human detection has two major parts: human face detection and human body detection. The features we adopt to detect a human face are color and shape. The color of the face undoubtedly is just the skin color, and the skin color has been studied intensively in recent years. In this study, we adopt an elliptic skin model to determine if the color of a pixel is skin color or not.
After getting all the skin color regions in an image, we have to recognize which one is similar to the shape of a human face. As the contour of a human face is roughly elliptic in shape, we propose a method for matching each skin color region with an elliptic shape mask. On the other hand, to avoid erroneously recognizing an elliptic non-face region as a face from skin color regions, we make a double check by motion detection.
If nothing is detected by the face detection process, we decide that a person might exist at a far distance. Then, we try to confirm the decision further by detecting the existence of a human body using moving regions in the image, which can be obtained by an additional process of frame differencing. The technique of frame differencing does not work finding the case of having a moving region in a changing background. We propose therefore a blockwise frame differencing technique to detect moving regions. After performing this technique, we can get moving regions in an image and detect any human body by applying a shape recognition technique to these moving regions.
The system will stop the human detection process and start a human tracking process as long as a face is detected in an image. The process of human tracking will
be described in the next chapter. The major steps of the proposed process of human detection are presented as follows.
Step 1. Capture an image.
Step 2. Apply region segmentation by skin color identification and motion detection by blockwise frame differencing to extract motion regions.
Step 3. Fit each extracted skin region with an ellipse to detect a possible human face.
Step 4. Apply human body detection by applying shape recognition to extracted motion regions.
The proposed process of human detection is illustrated in Figure 4.1.
Human
Figure 4.1 The proposed process of human detection.
4.3 Human Detection by Faces
In order to detect faces in images, we have to choose features to define a face. The features we used in this study are color and shape, as mentioned previously. The rough sketch of a face can be represented by the shape of an ellipse with the skin color. Thus, we detect a human face in the image by searching a skin-colored ellipse.
More specifically, recognizing a skin-colored ellipse in an image needs two tasks:
giving the definition of skin color and conducting pattern recognition of ellipses. With a captured image, we first segment out any skin region and then fit shapes to the regions. If a skin color region is close to an ellipse in shape, it is decided that a face is detected. In Section 4.3.1, we will introduce the proposed method of classification of skin color. And in Section 4.3.2, we will describe the proposed method for ellipse shape recognition.
4.3.1 Skin Region Segmentation Using Color Classification
Skin region segmentation is commonly used for face detection. Determining a color pixel is of a skin color or not is the goal. Before defining the classifier for skin color, the choice of color representations is important, which affects the complexity of the classifier.
In this section, we describe the color space and the classification algorithm proposed in this study.
4.3.1.1 YC
bC
rColor Space
Many common color models are used in the field of computer vision, for examples, RGB, HIS, HSV, YCbCr, CMY, CIE, etc. Each one of the color models has its own characteristics and is applicable to a specific set of applications. In the
application of skin region segmentation, classifiers for different color models are proposed in many research works.
In this study, we choose the YCbCr to be the color space for detecting skin color in images. According to Chai and Bouzerdoum [20], the distribution of skin color in YCbCr color space is concentrative and the distribution of the skin colors of different human races are similar. As the result, transforming images in the RGB color space into ones in the YCbCr color space can reduce the complexity of skin-color pixel classification.
In the YCbCr color space, Y represents the luminance, Cb represents the chrominance of blueness, and Cr represents the chrominance of redness. Y is coded from 16 to 235, where code 16 is black and 235 is white. And Cb and Cr range from 16 to 240. RGB values can be transformed into the YCbCr color space by (4.1) below:
b r
Y 16 65.481 128.533 24.966 R
C 128 37.797 74.203 112 G
C 128 112 93.786 84.214 B
⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤
Given the input RGB values which are within the range of [0, 1], the output values will be within the range of [16, 235] for Y and [16,240] for Cb and Cr.
Figure 4.2 shows the YCbCr color model with the value of Y being 126. Because the transformation from the RGB space to the YCbCr is linear but not one-to-one, some value sets of Cb and Cr are not meaningful when the value of Y is 126. Some YCbCr color models with different Y values are shown in Figure 4.3. When Y is 16 which is the darkest luminance, Cb and Cr can only have the value of 128. Likewise, when Y is 235 which is the brightest luminance, Cb and Cr also only have the value of 128. Figure 4.4 shows a 3D YCbCr color model [23][24].
Cb Cr
(65, 48)
(190, 207) Y=126
16 240
16 240
Figure 4.2 YCbCr color model with Y = 126
Figure 4.3 YCbCr color models with different Y values.
Figure 4.4 3D YCbCr color model in [23][24].
4.3.1.2 Adopted Skin Color Model
Previous studies have found that pixels belonging to skin regions exhibit similar Cb and Cr values [20][21]. Chai and Mgan [21] used a fixed-range skin color map in the Cb-Cr plane for face segmentation, and the range of Cb is between 77 and 127 and the range of Cr is between 133 and 173. And the region of the skin color is shown to be a rectangle. However, when we observe the distribution of skin color in the Cb-Cr
plane, it is found more similar to an oblique ellipse, as shown in Figure 4.5. In the study by Lee and Yoo [22], a new statistical color model for skin detection with
plane, it is found more similar to an oblique ellipse, as shown in Figure 4.5. In the study by Lee and Yoo [22], a new statistical color model for skin detection with