Audience Cameraman - Virtual Cameramen - 智慧型演講錄製系統

Chapter 4 Virtual Cameramen

4.2 Audience Cameraman

The purpose of the AC subsystem is to simulate the camera-control behaviors of a professional cameraman to capture the audience shot. The proposed system must decide which region to shoot before the camera can be aimed at the correct angle. The system flowchart of the AC is shown in Figure 4.20. The AC system is divided into two parts.

Two PTZ cameras are mounted together to make a set; one is the global-view camera and the other is the local-view camera. The global-view camera can be regarded as analogous to a photographer's eyes. It can monitor the whole audience and help with ROI detection. The local-view camera can be regarded as analogous to a photographer's camera. When the system has detected the ROI, the local-view camera can receive camera action control signals to shoot the ROI. The AC system performs face detection

Figure 4.19. Pointing recognition. The PTZ camera moves and shoots the area in which the speaker is pointing.

in local-view images, so that PTZ can shoot specific audience members with camera action rules. If an audience member’s face is detected, the AC controls the local-view camera to zoom and focus on that specific audience member; otherwise the AC performs ROI detection again. The output shot provided by the local-view camera is transferred to the VD; the VD performs shot selection.

A. ROI detection with the global-view camera

First, the AC obtains input videos from the global-view camera; then the AC scans for audience motion features to locate ROI candidates, which are the regions in which audience or crowd events have the highest probabilities to appear. The system uses

“Features from Accelerated Segment Test” (FAST) [75] corner detection to find the feature points; the system uses optical-flow estimation to estimate the motion vector of each feature point. Pixels are measured in terms of motion feature density to identify

Figure 4.20. Flowchart of audience cameraman.

ROI candidates. An ROI detection result is shown in Figure 4.21.

A.1 FAST corner detection

FAST corner detection is an approach used to reduce the computational time of optical-flow estimation significantly. As shown in Figure4.22, let point 𝑝 be the center of the image, and make an approximate circle with a circumference that consists of 16 pixels. Then observe the 16 pixels of the circumference, and mark the pixels with the following state:

state_𝑖 = { 𝑑𝑎𝑟𝑘 , 𝐼_𝑖 ≤ 𝐼_𝑝− 𝑡

𝑏𝑟𝑖𝑔ℎ𝑡 , 𝐼_𝑖 ≥ 𝐼_𝑝+ 𝑡 (4.14) where state_𝑖 is the state of pixel 𝑖, 𝑖 = 1, … ,16. 𝐼_𝑖 is the intensity value of pixel 𝑖, 𝐼_𝑝 is the intensity value of point 𝑝, and 𝑡 is a threshold. If 𝑚 pixels of the 16 pixels have the same state, the point 𝑝 is a FAST feature point; in this study, the 𝑚 is 9.

Figure 4.21. ROI detection (a) motion feature map (b) motion feature density map (c) ROI candidate.

(a) (b) (c)

Figure 4.22. FAST corner detection [75].

Figure 4.23 shows a FAST corner detection result.

A.2 Optical-flow estimation

After FAST feature detection, the optical-flow method calculates the motion vector between successive images. The motion vector extraction applies Lucas-Kanade optical-flow [76] estimation. The proposed optical-flow method is based on a pyramid architecture involving multiple layered images with different resolutions. The system performs coarse tracking on a low-resolution image; it performs fine tracking on a high-resolution image.

The fundamental theory of the optical-flow method assumes the intensity invariance of some pixel between successive images of a moving object taken within a very short time. The velocity vector is the motion direction of the pixel. 𝐸(𝑥, 𝑦, 𝑡) is the intensity value of pixel (𝑥, 𝑦) at time 𝑡. After ∆𝑡 time, ∆𝑥 and ∆𝑦 express the displacement; the equation is as follows:

𝐸(𝑥, 𝑦, 𝑡) = 𝐸(𝑥 + ∆𝑥, 𝑦 + ∆𝑦, 𝑡 + ∆𝑡) (4.15) The Taylor expansion of the aforementioned Equation 4.15 is as follows:

𝐸(𝑥 + ∆𝑥, 𝑦 + ∆𝑦, 𝑡 + ∆𝑡) = 𝐸(𝑥, 𝑦, 𝑡) + 𝐸_𝑥∆𝑥 + 𝐸_𝑦∆𝑦+𝐸_𝑡∆𝑡

(4.16) Then, by substituting (4.16) into (4.15) and dividing by

 t

at the same time, and performing simplification, we can obtain:

𝐸_𝑥^∆𝑥_∆𝑡+ 𝐸_𝑦^∆𝑦_∆𝑡 + 𝐸_𝑡 = 0 (4.17) Figure 4.23. FAST corner detection result.

When ∆𝑡 is lower than our threshold of acceptance, we can rewrite (4.17) as:

𝐸_𝑥^𝑑𝑥_𝑑𝑡+ 𝐸_𝑦^𝑑𝑦_𝑑𝑡+ 𝐸_𝑡 = 0 (4.18)

Let 𝑢 =^𝑑𝑥_𝑑𝑡 and 𝑣 =^𝑑𝑦_𝑑𝑡 be the motion velocities of the horizontal vector and vertical vector, respectively. Therefore, (𝑢, 𝑣) is the proposed optical-flow estimation result.

The optical-flow method can track FAST feature points as follows. Considering an image sequence I = {𝐼₁, 𝐼₂, … , 𝐼_𝑇}, let [𝑃_𝑥, 𝑃_𝑦]^𝑡 be the position of a feature point 𝑃, where 𝑡 ∈ {1, … , 𝑇}. 𝐼_𝑡(𝑃) is the intensity value of 𝑃 in the image. 𝜏 is the time interval. The motion between successive frames can be represented as follows:

𝐼_𝑡(𝑥, 𝑦) = 𝐼_𝑡+𝜏(𝑥 + ∆𝑥, 𝑦 + ∆𝑦) (4.19) where 𝑣⃑ = (𝑣_𝑥, 𝑣_𝑦) is the velocity for 𝑃 during the time interval 𝜏.

The purpose of the optical-flow method is to find the optimal

v



and minimize the matching error 𝜀. Let 𝑤_𝑥 and 𝑤_𝑦 be the half-width and half-height of the search window (see Figure 4.24). The definition of 𝜀 is as follows:

ε(𝑣⃑) = ε(𝑣_𝑥, 𝑣_𝑦)

= ∑^𝑝_𝑥=𝑝^𝑥^+𝑤_𝑥_−𝑤^𝑥 _𝑥∑^𝑝_𝑦=𝑝^𝑦^+𝑤_𝑦_−𝑤^𝑦 _𝑦(𝐼_𝑡+𝜏(𝑥 + ∆𝑥, 𝑦 + ∆𝑦) − 𝐼_𝑡(𝑥 + 𝑣_𝑥, 𝑦 + 𝑣_𝑦))² (4.20)

To obtain the optimized values

v

_xopt and

v

_yopt , we process the partial Figure 4.24. Search window of proposed optical-flow method.

66 The Taylor expansions of Equations (4.21) and (4.22) are as follows:

𝜕ε(𝑣⃑⃑) Because images are digital signals, the pixel values are not continuous values; they are discrete integers. We can perform finite difference calculations of the partial derivative values. Let the image width be W and height be H; then one may calculate:

𝐼_𝑥= ^𝜕𝐼_𝜕𝑥^𝑡 = {

Now substitute (4.26) and (4.27) into (4.23) and (4.24), respectively:

𝜕ε(𝑣⃑⃑)

𝜕𝑣_𝑥 = −2 ∑^𝑝_𝑥=𝑝^𝑥^+𝑤_𝑥_−𝑤^𝑥 _𝑥∑^𝑝_𝑦=𝑝^𝑦^+𝑤_𝑦_−𝑤^𝑦 _𝑦[𝐼_𝑡𝐼_𝑥+ 𝑣_𝑥𝐼_𝑥²+ 𝑣_𝑦𝐼_𝑥𝐼_𝑦] (4.28)

𝜕ε(𝑣⃑⃑)

𝜕𝑣_𝑦 = −2 ∑^𝑝_𝑥=𝑝^𝑥^+𝑤_𝑥_−𝑤^𝑥 _𝑥∑^𝑝_𝑦=𝑝^𝑦^+𝑤_𝑦_−𝑤^𝑦 _𝑦[𝐼_𝑡𝐼_𝑥+ 𝑣_𝑦𝐼_𝑦²+ 𝑣_𝑥𝐼_𝑥𝐼_𝑦] (4.29)

Then transform them to matrix form:

[ ∑^𝑝_𝑥=𝑝^𝑥^+𝑤_𝑥_−𝑤^𝑥 _𝑥∑^𝑝_𝑦=𝑝^𝑦^+𝑤_𝑦_−𝑤^𝑦 _𝑦𝐼_𝑥² ∑^𝑝_𝑥=𝑝^𝑥^+𝑤_𝑥_−𝑤^𝑥 _𝑥∑^𝑝_𝑦=𝑝^𝑦^+𝑤_𝑦_−𝑤^𝑦 _𝑦𝐼_𝑥𝐼_𝑦

∑^𝑝_𝑥=𝑝^𝑥^+𝑤_𝑥_−𝑤^𝑥 _𝑥∑^𝑝_𝑦=𝑝^𝑦^+𝑤_𝑦_−𝑤^𝑦 _𝑦𝐼_𝑥𝐼_𝑦 ∑^𝑝_𝑥=𝑝^𝑥^+𝑤_𝑥_−𝑤^𝑥 _𝑥∑^𝑝_𝑦=𝑝^𝑦^+𝑤_𝑦_−𝑤^𝑦 _𝑦𝐼_𝑦² ] [𝑣_𝑥

𝑣_𝑦] = [𝐼_𝑥𝐼_𝑡

𝐼_𝑦𝐼_𝑡] (4.30)

Rewrite (4.30) to 𝐺𝑣⃑ = −𝑏, and 𝑣⃑ = −𝐺⁻¹𝑏 is the proposed vector.

If the motion is fast, we can increase the search window size, but the computing time will increase too. In the present research, the search window size was 10×10. If the displacement of the feature point is excessively large, it can be challenging to track the feature point.

The Lucas-Kanade optical-flow method provides a solution to overcome this problem by building an image pyramid. First, we define 𝐼^𝐿 to be the

L

th layer of the image pyramid where 𝐿 = {0,1,2, … , 𝑚} and 𝐼⁰ is the original image. By using bilinear interpolation, we can measure the intensity of pixels on the next layer. The following is the bilinear interpolation function:

𝐼^𝐿(𝑥, 𝑦) =1

4𝐼^𝐿−1(2𝑥, 2𝑦) +1

8[𝐼^𝐿−1(2𝑥 − 1,2𝑦) + 𝐼^𝐿−1(2𝑥 + 1,2𝑦) + 𝐼^𝐿−1(2𝑥, 2𝑦 + 1)]

+₁₆¹ [𝐼^𝐿−1(2𝑥 + 1,2𝑦) + 𝐼^𝐿−1(2𝑥 − 1,2𝑦 + 1) + 𝐼^𝐿−1(2𝑥 + 1,2𝑦 + 1)] (4.31) Therefore, we can perform tracking on the image pyramid. First, the displacement of the 𝑚^𝑡ℎ layer is computed. Sequentially, the 𝑚 − 1^𝑡ℎ layer must be computed until 𝑚 = 0. The tracking process is to track from the low-resolution layer to the high-resolution layer. In other words, we find rough features at low high-resolution and update the location layer by layer. By using the technique, we can improve the stability of the optical-flow estimation. The image pyramid in the present research was a four-layer architecture, therefore 𝑚 = 3.

A.3 ROI selection

Next, the motion information from optical-flow estimation must be converted into motion density maps. Each motion density map is assembled from the statistics of optical-flow estimation through a motion point and its moving distance. As high and low densities give different shades and intensities, a low-density area means smaller motion, whereas a high-density region means larger motion. A region with relatively high density and with relatively numerous and concentrated motion vectors can be selected as a candidate ROI.

However, the aforementioned method may choose multiple ROI candidates with the highest motion density. And some noise in the image might also be chosen as ROI candidate. To avoid those situations, this study applies the STA network model (described in Chapter 2) to record a captured ROI in an attention map. From the information in the attention map the system must determine whether the candidate ROI can be considered a feasible ROI. The ROI candidates are then entered as inputs into the STA (see Figure 4.25). The STA model can record and provide additional information to help the system to determine the ROI that is most suitable as a photographic subject. Further, the system computes the relative distance between the location of the ROI center and the image center and outputs motion-control signals for the local-view camera based on camera action rules.

B. Face detection with the local-view camera

Figure 4.25. ROI selection (a) ROI detection result (b) the shot after camera steering (c) the attention map for selection.

(a) (b) (c)

Because the AC subsystem is focused on the audience, numerous faces might appear in the AC shot simultaneously; therefore all face information is vital. At the same time, faces are the most easily recognizable parts of the audience. The local-view camera captures video information from the ROI by considering aesthetics and by analyzing optical characteristics. The AC system simulates professional photography shooting skills through the aforementioned process.

B.1 Face detection method

To find the audience in the shot, we must determine face locations. The AC uses the same face detection method as the SC uses. The subject’s face is generally selected, because it is simple and uniform relative to the body, which is covered by clothes. The AdaBoost algorithm described in Section 4.1 is also applied in the AC subsystem. After the completion of face detection, the AC subsystem records the center coordinates, height, and width of each individual face. Each individual face is a principal candidate of a salient object. Finally, the recorded face coordinates nearest to the center of a recent ROI are selected as a target (see Figure 4.26).

B.2 Camera action and shot composition

Faces are not only salient objects for the cameraman; the center of the screen can Figure 4.26. Face detection process (a) ROI detected result (b) the face results after

camera steering (c) the face of salient object.

(a) (b) (c)

also be a salient object. For example, when the salient object is the audience, to highlight specific audience members, the proportion of the selected audience in the picture must not be too small; one-third of the size of the image is best. The details of the shooting and camera action rules are described in Section 4.4.

在文檔中智慧型演講錄製系統 (頁 74-83)