Chapter 3. Camera Coordination
3.1. Problem Formulation
At the start, we define the problem that we want to solve. Unlike the articles we introduce in Section 2.1, we aim to capture as many frontal high-resolution facial images as possible during the presence of the monitored targets. In the proposed
Optimization
Camera Adjustment Input Video
OVVV
Output Video
24
algorithm, PTZ cameras are allowed to cover more than one target at each time, as long as the captured facial images are sufficiently clear. Moreover, we allow the tracking of a target can be handed over from one PTZ camera to another PTZ camera so that the face of that target can be better observed over time. In the proposed algorithm, we design our camera coordination system based on two major criteria:
frontal shoot and high-resolution shoot.
To formulate these two criteria, we define the shoot angle θij, and the face width Wij. In θij and Wij, the subscript i denotes the i-th PTZ camera, while the subscript j denotes the j-th target. As shown in Figure 3-2, the shoot angle θij represents the angle between the blue arrow camij and the green arrow face . j camij indicates the line connecting the i-th PTZ camera and the j-th target, while face indicates the facing j orientation of the j-th target. As the j-th target is looking toward the i-th camera, we have a smaller shoot angle. On the other hand, as shown in Figure 3-3, the shot face width Wij represents the width of the j-th target’s face in the image captured by the i-th camera. A larger value of Wij indicates a better observation of the j-th target in the the i-th camera image.
Figure 3-2 The illustration of θij
Figure 3-3 The illustration of Wij
Wij
25
To simplify the computation of θij and Wij, all 3-D vectors are projected onto the ground plan to form 2-D vectors instead of calculating 3-D vectors directly. In other words, we only consider the 2D vectors here in order to simplify the computation. In the simplified forms, the shoot angle and the face width are defined as follows.
( )
In the definition of Wij, fxi denotes the focal length of the i-th PTZ camera in the horizontal direction, Dij is the distance between the i-th camera and the j-th target, and FOVi is the field of view of the i-th PTZ camera. Eq. 3-2 and Eq. 3-3 originate in the pinhole camera model. Originally Eq. 3-2 is used only when the face is on the center of image. However, we do not need a very precise face width in the image. Hence, we simply define the face width in an approximated way to simplify the computation.
The shoot angle and the face width are two different physical quantities. In addition, the desired tendencies of the two quantities are different. Basically, we prefer to capture a facial image with a smaller shoot angle but a larger face width.
Therefore, we apply two mapping functions Nθ( ) and Nw( ) over θij and Wij to convert them into two normalized measures. The two different quantities can be unified after normalizing. Here we define the values to become lager after normalizing when the performance we desire is became well. In other words, we set higher “scores” for better capture situations. For example, we hope that θij is as small as possible so that we can see more frontal face. Therefore, the value of Nθ(x) becomes lager as the x becomes smaller. These two mapping functions are defined as follows and are illustrated in Figure 3-4 and Figure 3-5.
( ) (
1)
, 026
In Eq. 3-4 and Eq. 3-5, k is a positive constant that controls the dynamic range of Nθ( ) and Nw( ). rθ and rW are real numbers within the range [0, 1] and they control the slopes of Nθ( ) and Nw( ). thθ, thmin, and thmax are pre-defined thresholds. thθ
represents the worst situation that can be allowed for capturing the frontal face. thmin
represents the minimum face width for clear observation. On the other hand, when the face width is wider than thmax, we think the facial image has achieved the level of perfect observation. These thresholds can be varied by the users for different applications.
Figure 3-4 Normalized function of the bias angle
0 x
27
Figure 3-5 Normalized function of the face width
The physical meaning of thθ is the worst situation for capturing the frontal face.
In other words, we hardly clearly see (or identify) someone’s face when the angle between face vector and camera vector exceeds thθ. Similarly, thmin and thmax represent the worst and best case of face width in the image respectively. When the face width is smaller then thmin, we also hardly see the clear face because of the low resolution.
Conversely, when the face width reaches or exceeds the threshold, thmax, we can clearly to identify this face. The function of rθ and rW are to adjust the slopes of the linear part of the normalized functions and the maximal and minimal values of the normalized functions. It will affect the weightings of the orientation and clearness.
For example, if rW becomes smaller, the largest and smallest values of the normalized function will be closer and the difference between them is smaller. That means the discrimination of the face resolution is decreased. Under the extreme condition, if we let the rW be zero (and it will make the slope zero), any face width will get the same normalized value. That makes no difference no matter what the face width is after the normalization.
The goal is that our system finds a camera coordination way to make each θij as small as possible while make each Wij as large as possible. With the definitions of Nθ
and NW, we then define Eval( ) (Eq. 3-6) for the face capture of the j-th target by the i-th camera. It is defined to evaluate the different coordination. The large the value of Eval( ) is, the better the performance of coordination is.
( ) ( ) ( )
In Eq. 3-6, m and n are the number of cameras and targets respectively. AP denotes a set of camera assignments and is defined as Eq. 3-7:
{ }
ij , 1, 2, , , 1, 2, ,AP= ap i= … m j= … n Eq. 3-7
apij represents the binary assignment parameters. apij is equal to 1 if the i-th camera is assigned to monitor the j-th target, and apij is equal to 0 otherwise. Hence, for a camera assignment AP, Eval(AP) represents the overall observation levels of the n targets by all m cameras. When more targets can be better observed by their corresponding cameras, with smaller shoot angles and larger face widths, we have a larger Eval(AP). Hence, the goal of the proposed camera coordination system is simply to find the optimal camera assignment that reaches the largest Eval(AP).
Moreover, as these n targets keep moving within the monitored scene, we need to
28
adaptively adjust the assignment of cameras to achieve the most preferable observation.
Besides, to simplify the problem, we also add one extra constraint over Eq. 3-6.
The constraint is stated in Eq. 3-8:
1
This constraint implies that we only take into account the camera view that is assigned to the target even though that target may also appear in some other views.
Because there are two criteria, one target has two observation level, the level of orientation (shoot angle) and the level of clearness (face width). They are the values of the two normalized functions, Nθ and NW, respectively. The zero values of the normalized functions mean that the situation of orientation or clearness is too bad to identify the target’s face. Here, we multiply these two scores together to form the final score. This is because we consider these two scores to be dependent. We consider that if one of the scores for a target is low, we will not be able to clearly see that target even though the other score is high. Hence, as one score is high but the other one is low, the final score is still low. In addition, when the performance of orientation or clearness is lower than a threshold, according to Eq. 3-4 or Eq. 3-5, the value of Eq.
3-6 (total score) is set to zero.
We add all the targets’ overall observation levels together to evaluate the performance of camera coordination for all targets. Obviously, according to the mapping functions we define, the value of the evaluation function (Eq. 3-6) will become larger if the performance of coordination gets better. That is to say, more frontal and higher resolution faces. Thus, the goal is that we want to find a set of camera assignment (assigned parameters), AP, which makes the evaluation function maximal. In other words, we want to find an optimal AP here.