Experimental Results - Tracking of Facial Feature Points

Chapter 3 Tracking of Facial Feature Points

3.6 Experimental Results

Some experimental results of applying the proposed method for tracking feature points are shown in Figure 3.18, from which it can be seen that the proposed method not only can track facial feature points, but also can correct the positions of the feature points so that we can get the correct results.

(a) (b) (c)

(d) (e) (f)

(g) (h) (i)

Figure 3.18 A resulting sequence of tracking feature points in a video clip of speaking “everybody” in Chinese.

Chapter 4 Creation of Virtual Faces with Dynamic Mouth Movements

4.1 Idea of Proposed Technique

The main purpose of this study is to enable a person seen in a still input image to say the same words uttered by another person appearing in a video model, that is, to let the input image have the same mouth shapes as those in the video model. In this study, we use a morphing method to warp the input image to the frames in the video model to achieve this goal. That is, we divide mouth shapes into quadrilaterals in the input image and do the same in every frame of the video model, and then map every quadrilateral of the input frame to those of the frames of the video model.

In this chapter, the mouth shape division technique we propose is described in Section 4.1.1, and the main steps of the proposed virtual face creation process are described in Section 4.1.2.

4.1.1 Mouth Shape Division

We separate the mouth image into two parts: a mouth part and a skin part which is near the mouth. The mouth part is divided into fourteen overlapping quadrilaterals, and the skin part is divided into thirteen overlapping quadrilaterals, as shown in Figure 4.1.

The way of such divisions is to partition the mouth or skin part into

quadrilaterals according to the mouth features, such as upper lip, bottom lip, and teeth.

The quadrilaterals 1 through 4 as shown in Figure 4.1 compose the upper lip, and the quadrilaterals 9 through 14 compose the bottom lip. The reason why we divide the two lips in such ways is that we want the teeth part (quadrilaterals 5 through 8) to be independent, because the teeth part is the only part whose image information is obtained from the video model.

9 14

Figure 4.1 Proposed mouth shape division scheme which divides the mouth shape into twenty-seven overlappingquadrilaterals.

4.1.2 Main Steps of Proposed Virtual Face Creation Process

The main steps of the proposed virtual face creation process are mouth shape division, mouth shape morphing, and mouth region extraction, as shown in Figure 4.2.

First, we divide the mouth image into quadrilaterals by the previously-mentioned technique. Then, we use an image morphing technique to let the input image have the same mouth shapes as those in the video model, which is described in Section 4.3.

After the mouth shape morphing, we can get a mouth image sequence. We do not

paste the entire mouth shape onto the input image. Instead, we propose an extraction scheme to extract the mouth region more precisely from every frame of the image sequence. To make the mouth region smoother, we fill the gaps and smooth the boundary of the mouth region. Finally, we paste each mouth region onto the input image by some rules described later to get the result.

Figure 4.2 The flowchart of proposed virtual face creation from image sequences.

4.2 Creation of Real-Face Video Model

In this study, the real-face video model is recorded with a camera, and a person in it says some speeches. Each of such models in the proposed system is used to enable a person in the input image talk, so we have to extract talking information from the videos to convert them into useful video models. The talking information include the mouth shapes, the mouth size changing information, the feature point positions, and the image information in the video models.

In this section, some criteria identified in this study for real-face video model creation are described in Section 4.2.1. The way to locate feature points in real-face images is presented in Section 4.2.2. Finally, the proposed real-face video model creation process is described in Section 4.2.3.

4.2.1 Criteria for Real-Face Video Model Creation

We now describe the criteria for real-face video model creation we propose to make the extraction of the talking information smooth. By the way, some assumptions made for the real-face video model were mentioned in Section 1.3.2.

The frame rate of our video models is 30 frames per second, that is, playing a frame needs about 0.03 seconds. However, the time of typical mouth shape changing during a normal talking process is less than 0.03 seconds. The most important restriction is the talking speed. The speech must be spoken with a medium speed to make the tracking easier.

4.2.2 Locating Feature Points in Real-Face Images

A real-face video model is composed by a real-face image sequence. Because the feature point positions of the first frame of each video model are used to acquire the

feature points of the other frames, we must locate the positions of the feature points of the first real-face image precisely.

The positions of the feature points are the same as the black dots shown in Figure 4.1. Then, the system automatically adjusts the x coordinates of these feature points to move them to symmetric positions, as shown in Figures 3.10(b) through 3.10(d).

Finally, we manually adjust the y coordinates of these feature points to make them fit the edge of the mouth edges.

4.2.3 Real-Face Video Model Creation Process

The proposed real-face creation process includes warping the input image to each frame of the video model for getting the virtual-mouth image, scaling the virtual-mouth image according to the real-face video model, and integrating the virtual-mouth image with the input image, as shown in Figure 4.3. Figures 4.3(c) and 4.3(d) have the same mouth shape as Figure 4.3(b).

During this process, the mouth image may be enlarged and then reduced, or reduced and then enlarged, or unchanged. Enlarging images usually incurs image blurring, and we want to avoid this situation to happen. In this study, the mouth size of the video model and that of the input image have three relations as follows:

(1) The mouth size of the input image < the mouth size of the video model.

(2) The mouth size of the input image > the mouth size of the video model.

(3) The mouth size of the input image = the mouth size of the video model.

Relation (2) is exactly what we want to avoid; the input image in this relation will be warped to the smaller images of the video model and then scaled back to the large image size. So we propose a way to solve this by adjusting every frame in the video model according to the scale of the mouth width of the input image and the first frame. The mouths in the input image and in the frames should have the same width,

so that the input image does not need to be reduced within the warping process.

(a) (b) (c) (d)

Figure 4.3 The mouth images. (a) The mouth image of the input image. (b) The mouth image of a frame of the video model. (c) The virtual-mouth image is created from (a) by warping it to (b). (d) The virtual-mouth image is a scaled mouth image from (c) and is integrated with (a).

4.3 Mouth Shape Morphing with Bilinear Transformation

We use the bilinear transformation and inverse bilinear transformation proposed by Gomes, et al. [14] to transform between two quadrilaterals, and the details of the bilinear transformation are described in Sections 4.3.1 and 4.3.2, respectively. The proposed mouth shape morphing process is described in Section 4.3.3.

4.3.1 Review of Bilinear Transformation

Gomes, et al. [14] proposed the technique of bilinear transformation to warp a unit square to an arbitrary quadrilateral, and this transformation is denoted as T(u, v).

They defined T1(u, v) and T2(u, v) according to the following equation:

T(u, v) = ( T1(u, v), T2(u, v)), (u, v) ∈ [0, 1]²; (4.1)

T1(u, v) = auv + bu + cv + d; (4.2)

T2(u, v) = euv + fu + gv + h. (4.3)

As shown in Figure 4.4, we know the values of transformations T1, and T2 on the vertices of the square to be

T1(0, 0) = a1, T1(0, 1) = b1, T1(1, 1) = c1, T1(1, 0) = d1; T2(0, 0) = a2, T2(0, 1) = b2, T2(1, 1) = c2, T2(1, 0) = d2.

(4.4)

Figure 4.4 The proposed bilinear transformation in Gomes, et al. [14].

Then, we compute the coefficients of Equations (4.2) and (4.3) by substituting T1(u, v) and T2(u, v) in Equation (4.4), then we obtain these coefficients with a symbolical expression which is listed as follows:

a=c1− b1 + a1 – d1; b = b1− a1;

c = d1− a1; d = a1;

e = c2− b2 + a2− d2;

(4.5) (4.6) (4.7) (4.8) (4.9)

f = b2− a2; g = d2− a2; h = a2.

(4.10) (4.11) (4.12) We know the coordinates of the four vertices of the quadrilateral ABCD, so that we can compute these coefficients and can determinate the bilinear transformation by plugging in these values in Equations (4.2) and (4.3). For every pixel (u, v) in the unit square can find a corresponding pixel (x, y) in ABCD.

4.3.2 Review of Inverse Bilinear Transformation

Gomes, et al. [14] also proposed an inverse bilinear transformation which is conducted in a reverse direction from the quadrilateral to the unit square, as shown in Figure 4.5. A pixel R in quadrilateral ABCD is denoted as R(x, y).

Figure 4.5 The proposed inverse bilinear transformation in Gomes, et al. [14].

Solving for u and v in Equation (4.2), and substituting these in Equation (4.3), we can obtain two equations as follows:

(au + c)(fu + h − y) − (eu + g)(bu + d − x) = 0;

(av + c)(fv + h − y) − (ev + g)(bu + d − x) = 0.

And they can be rewritten as follows:

Then, we can get two solutions of u and v as follows:

( )

4.3.3 Proposed Mouth Shape Morphing Process

Because the quadrilaterals of the mouth we deal with are not unit squares, we use the transformations by Gomes, et al. [14], as shown in Figure 4.6.

Figure 4.6 The proposed transformations between two arbitrary quadrilaterals in Gomes, et al. [14].

For mouth shape warping, we use a backward bilinear transformation from the quadrilateral A′B′C′D′ to the quadrilateral ABCD, as described in the following algorithm.

Algorithm 4.1. Bilinear transformation between two quadrilaterals.

Input: A quadrilateral ABCD with color information and the coordinates of the four vertices of the quadrilateral A′B′C′D′.

Output: A quadrilateral A′B′C′D′ with color information.

Steps:

1. Compute the values of a′, b′, c′, d′, e′, f′, g′, and h′ of A′B′C′D′ by Equations (4.5) through (4.12).

2. Compute the values of a, b, c, d, e, f, g, and h of ABCD by Equations (4.5) through (4.12).

3. For every pixel (x′, y′) in A′B′C′D′, perform the following steps to find the corresponding pixel (x, y) in ABCD.

3.1 Compute the values of E, F, G, H, I, and J with a′, b′, c′, d′, e′, f′, g′, and h′ by Equations (4.13).

3.2 Compute the corresponding (u, v) with E, F, G, H, I, and J by Equations (4.14) and (4.15).

3.3 Compute the corresponding position (x, y) with a, b, c, d, e, f, g, h, u, and v by Equations (4.2) and (4.3).

4.4 Creation of Virtual-Face Image Sequences

In this section, the proposed process for generation of single virtual-mouth images is described in Section 4.4.1. And the proposed technique for scaling mouth sizes by the real-face model is described in Section 4.4.2. The proposed process for extraction of mouth regions from virtual faces is described in Section 4.4.3. The proposed processes for image gap filling and boundary smoothing are described in Section 4.4.4. Finally, the proposed process for creation of virtual-face images is described in Section 4.4.5.

4.4.1 Generation of Virtual-Mouth Images

We morph every quadrilateral of the input image to a corresponding one of each frame of the video model with the previously-mentioned shape division, and an example is given in Figure 4.7. The coordinates of point 2.1 and the additional points which help morphing marked as blue dots in Figure 4.1 are listed as follows:

(1) Point P84_21 = (point 8.4.x − 1, point 2.1.y + 2);

(2) Point P83_21 = (point 8.3.x + 1, point 2.1.y + 2);

(3) Point P108_21 = (point 10.8.x, point P84_21.y);

(4) Point P107_21 = (point 10.7.x, point P83_21.y);

(5) Point P108_84 = (point 10.8.x, point 8.4.y);

(6) Point P107_83 = (point 10.7.x, point 8.3.y);

(7) Point 2.1.y =point 2.1.y + 5.

(a) (b) (c)

Figure 4.7 Generation of a virtual-mouth image. (a) An Angelina Jolie’s photo as the single input image.

(b) The real-face image which is the 50^th frame of the video model. (c) The virtual-mouth image.

4.4.2 Scaling of Mouth Sizes by Real-Face Model

After generating the virtual-mouth images, we scale them to fit the input image.

The input image has a closed mouth. If we scale the mouth only according to the scale of the mouth size of the input image, the resulting image will be unnatural, as shown in Figure 4.8(e). We propose a technique to scale the mouth size according to the real-face model, as shown in Figures 4.8(a) and 4.8(b), so that the result of the scaled mouth size is more natural as shown in Figure 4.8(f).

(a) (b) (c)

(d) (e) (f)

Figure 4.8 The illustration of scaling mouth sizes. (a) The first frame of the video model. (b) The 85^th frame of the video model. (c) A single input image. (d) The virtual-mouth image. (e) The virtual-mouth image scaled by (c). (f) The virtual-mouth image with a scaled mouth.

Also, the chin must have the same vertical movements as that of the mouth, which has three feature points P84_21, P83_21, and 2.1. So, we scale the mouth size and adjust the position of the chin according to the size changing information of the mouth in the current frame and that in the first frame. We compute the scaling rate Rx

and Ry of the current frame, and move the points, including points A, B, C, D, E, F, P84_21, P83_21, and 2.1 as shown in Figure 4.9, to the scaled positions which are decided by the scaling rate. The coordinates of A, B, C, D, E, and F are assigned according to those of the mouth control points mentioned in Section 2.2.3.

In the scaling process, the mouth division, as shown in Figure 4.9, is different from that discussed in Section 4.1.1, and the mouth is inside the quadrilaterals DEBA and EFCB. Here, we reassign the coordinates of some points in the following way:

(1) Point A = (point 8.4.x − 3, point 8.9.y);

(2) Point B = (point 8.1.x, point 8.9.y + 1);

(3) Point C = (point 8.3.x + 3, point 8.9.y);

(4) Point D = (point 8.4.x, point 8.2.y);

(5) Point E = (point 8.2.x, point 8.2.y + 3);

(6) Point F = (point 8.3.x, point 8.2.y);

(7) Point P108_21.y = point P88_82.y;

(8) Point P107_21.y = point P82_87.y;

(9) Point P84_21.x = point 8.8.x − 2;

(10) Point P83_21.x = point 8.7.x + 2.

(4.16)

Figure 4.9 Proposed mouth shape division scheme used to scale the mouth size, which divides the mouth shape into 12overlappingquadrilaterals, including quadrilaterals DEBA and EFCB.

A mouth has two kinds of movement: the vertical movement My and the horizontal one Mx. Mx and My are computed by the scaling rates and the size of the mouth (described later). The vertical movement of the bottom lip is usually larger than that of the upper lip, so we define the vertical movement of the upper lip to be one third of the value of My. The horizontal movements of the left mouth part and the right mouth part are both defined equally to be half of Mx.

If the mouth width MW0 of the current frame is smaller than that in the first frame in the video model, we reduce the width, the height, and the number of the quadrilaterals which will be transformed, as shown in Figure 4.10(b). Then, the resulting virtual mouth will contain only the mouth and skins which are near the mouth, and have no contour of the virtual face, as shown in Figure 4.10(a). We adjust the positions of the vertices of these quadrilaterals in following way:

(1) Point 10.8 = (point 8.4.x − (8.4.x − 10.8.x) / 3, point P108_84.y − 5);

(2) Point P108_84.x = point 10.8.x; (4.17)

(3) Point 10.7 = (point 8.3.x + (10.7.x − 8.3.x) / 3, point P107_83.y − 5);

(4) Point P107_83.x = point 10.7.x;

(5) Point P84_21.y = point 8.2.y + (2.1.y − 8.2.y) / 4;

(6) Point P83_21.y = point P84_21.y;

(7) Point 2.1.y = point P84_21.y + 2.

(a)

(b)

Figure 4.10 Illustration of the scaled mouth shape when the mouth width of the current frame is smaller than that in the first frame in the video model. (a) The virtual-mouth image containing the mouth and the skins near it. (b) Proposed mouth shape division scheme used to scale the mouth size.

We generate the virtual-mouth images with scaled mouths and dynamically moving chins by the following algorithm.

Algorithm 4.2. Generation of virtual-mouth images with scaled mouths and dynamically moving chins.

Input: A virtual-mouth image sequence Vmouth, and an input image I.

Output: A virtual-mouth image sequence V_mouth′ with scaled mouths and dynamically moving chins.

Steps:

1. For each virtual-mouth image V of Vmouth except the first frame V0, perform the following steps.

1.1 Assign the coordinates of points A, B, C, D, E, F, P108_21, P107_21, P84_21, and P83_21 in V and I by Equations (4.16) for the mouth division.

1.2 Compute the scaling rates Rx and Ry by the following equations:

( )

1.3 Compute the horizontal movement Mx and vertical movement My of the mouth as follows:

as follows.

1.4.1 Move upward the points in the upper lip, including points A, B, and C, by subtracting My /3 from the y-coordinates of these points.

1.4.2 Move downward the points in the bottom lip and the chin, including points D, E, F, 2.1, P84_21, and P84_21, by adding My /3 to the y-coordinates of these points.

1.4.3 Move the point A in the left mouth left by subtracting Mx /2 from the x-coordinate of A.

1.4.4 Move the point C in the right mouth right by adding Mx /2 to the x-coordinate of C.

1.5 Warp the quadrilaterals DEBA and EFCB in V to those in I to get a warped virtual-mouth image V′.

1.6 If MW0v ≥ dMW0 × 0.85,

warp the quadrilaterals 1 through 10 in V′ and I to get the final scaled virtual-mouth image V′′;

else, perform the following steps.

1.6.1 Change the sizes of quadrilaterals 1, 2, 3, 4, 5, 6, 9, and 10 by Equations (4.17) to let the composition of these quadrilaterals be a mouth.

1.6.2 Warp these quadrilaterals in V′ to I, and blend quadrilaterals 1, 2, 3, 4, 9, and 10 with α by alpha blending to get the final scaled virtual-mouth image V′′.

2. Compose the virtual-mouth sequences Vmouth′ by all V′′.

4.4.3 Extraction of Mouth Regions from Scaled-

Mouth Images

The virtual-mouth image with a scaled mouth is called a scaled-mouth image in this section. We perform the extraction of mouth regions when MW0v ≥ dMW0 × 0.85.

In this condition, the scaled-mouth images contain the contours of the faces. We extract the mouth regions Rmouth by using a binary image B generated by edge detection as mentioned previously in Section 3.5.2.

We propose a technique to extract the mouth region by observing the color changing information of B, as shown in Figure 4.11(b). Because we only care about the edge of the face, we remove the edge information of the mouth in B by marking them with the black color. The background is black, and the edge is marked with the white color in B. After the extraction, we get an image B′ and a mouth region Rmouth. In B′, the blue part is the edge of the face, and the black part is the facial skins, as shown in Figure 4.11. The blue and the black parts are what we want to keep and they compose the mouth region.

(a) (b)

Figure 4.11 The facial images. (a) The scaled-mouth image created from the 85^th frame of the video model. (b) The image B. (c) The image B′. (d) The mouth region of (a).

We scan the pixels in B in a vertical direction, and notice the color change. The order of the colors must be black, white, and black, and it means that we scan pixels from the face portion to the edge, and then to the neck below the edge. The Rmouth

contains the edge and the face above the edge. The face sometimes has noise as white dots in B′ according to our experimental experience, so we require that the edge have H_edge continuous white pixels in B where Hedge is a user-defined constant. We extract the mouth region in a region R1 in B composed by bUpLf and bDnRt, and we define the range of the mouth by a rectangle R2 in B composed by sUpLf and sDnRt, as shown in Figure 4.12. The details of the extraction are described in the following algorithm.

Figure 4.12 Illustration of the range of the mouth and the mouth region.

Algorithm 4.3. Extraction of mouth regions from scaled-mouth images.

Input: A scaled-mouth image V of Vmouth′ generated by Algorithm 4.2, a binary image B generated by Algorithm 3.3, and a user-defined constant Hedge for detecting the edge.

Output: A mouth region R_mouth and an image B′.

Steps:

1. Let each pixel P of B have a corresponding pixel P′ in B′ whose coordinates are the same as those of P′. For each P′, set the initial color C′ as the color C of P.

2. Remove the edge values of the mouth by changing the pixel colors in the rectangle R2 in B, as shown in Figure 4.12, to the black color.

3. For each vertical line L in R1 in B, as shown in Figure 4.12, perform Steps 3.1 through 3.3.

3.1 Scan the pixels in L one by one, and set the initial value of the color Cpre of the previously-scanned pixel to be white, meaning that C′ of the previously-scanned pixel is not in Rmouth.

3.2 Set the initial value of the height of the edge H which will be detected to be 0.

3.3 For each pixel P in L, perform the following steps to get an image B′ which have three colors, including black, white, and blue.

3.3.1 If Cpre = black and C =black, then set C′ =black;

3.3.2 If Cpre =black and C = white, then set Cpre = white, C′ = blue, and H = 1;

3.3.3 If Cpre =white and C =black and H >Hedge, then set Cpre = blue and C′ = C;

3.3.4 If Cpre =white and C =black and H ≤ Hedge, then set Cpre =black, C′ = C, and H = 0;

3.3.5 If Cpre =white and C =white, then set H = H + 1 and C′ = blue;

3.3.6 If Cpre =blue, then set C′ =white.

4. Keep the pixels in V if the colors of them in B′ are not white, and compose the mouth region Rmouth by the kept pixels.

4.4.4 Gap Filling and Boundary Smoothing

After the extraction of the mouth region, we find a problem: the edge of the mouth region is not smooth, as shown in Figure 4.11(c). In the mouth region, some

在文檔中使用視訊模型從單一影像自動產生有動態嘴形動作的虛擬人臉之研究 (頁 52-0)