Virtual-Face Creation Process from Sequential Images

Chapter 2 Overview of Proposed Method for Virtual Face Creation

2.4 Virtual-Face Creation Process from Sequential Images

Figure 2.7 illustrates a flowchart of the stages of proposed virtual face creation from sequential images. First, a neural facial image and the first frame of a real-face video model are used as inputs to a feature point locator. After the work of feature point location is accomplished by the locator, the remaining frames of the video

model and the feature points of the first frame are used as inputs to a feature point tracker.

Then, the feature point tracker tries to extract the feature points of the remaining frames of the video model. Here the problems we mentioned in Section 2.1 are found to happen often when a closed mouth is opening or when an opened mouth is closing.

So we propose to detect the states of the mouths, including the two above-mentioned states: the opening state and the closing state, and an unchanged state meaning that the mouth size in the current frame is same as that in the previous frame. Then, we use the information of the mouth states to change the matching area dynamically to reduce incorrect matching results. The area changing technique is called window size editing in the following.

When an opened mouth is shrunk gradually to be a closed mouth, the positions of the feature points of the inner upper mouth part sometimes will become different from those of the bottom inner mouth part. So we propose a technique to detect closed-mouth shapes and move the positions of the feature points of the inner mouth part to certain correct positions we want.

We also propose a technique to track feature points in a frame according to the image information in the previous frame. If the feature points in the previous frame are located on wrong positions, the tracker will track the wrong points in the remaining frames in the video model. Feature point correction so is necessary to make sure that the positions of the feature points are all correct; otherwise, feature point tracking will fail, according to our experimental experience.

The virtual face creator we propose will then divide and morph the mouth shapes to get the bottom part of every virtual face. The final step is to extract the mouth region from the virtual face and integrate it with the input image. This completes the proposed process for virtual face creation.

20 Camera

Image Sequences

Single Facial Images Real-Face Video Models

Feature Point Locator

Virtual Face Creator Feature Point Tracker Mouth State Detecting

Window Size Editing

Image Matching

Closed-Mouth Shape Detecting

Facial Feature Point Correcting

Mouth Shape Division

Mouth Shape Morphing

Extraction of Mouth Region

Figure 2.7 Stages of proposed virtual face creation from sequential images.

Chapter 3 Tracking of Facial Feature Points

3.1 Idea of Proposed Techniques

As mentioned in Section 1.2.2, during tracking of facial feature points, suppose that a subimage w at coordinates (s, t) within an image f is processed. Then, the moving range of w inside f is taken to be [2s+1, 2t+1] in this study. The region of w is called a content window, and the moving range is called a search window. We propose in this study an image matching technique using the mouth movement information to change the size of the content window and the search window.

Applying this technique, we can solve the interference problem of changed mouth-shape and teeth appearances mentioned in Section 2.1.

In this chapter, the necessity of changes of content and search window sizes and correction of facial feature point positions are explained in Section 3.1.1 and Section 3.1.2, respectively. Finally, the proposed method for tracking facial feature points is described in Section 3.1.3.

3.1.1 Necessity of Changes of Window Sizes

Because the mouth shapes are not all the same during a human’s talking process, the content window sometimes will include insufficient or too much information for image matching. Two other reasons for using different window sizes for each feature point are that the teeth will interfere the matching process in the tracking of some feature points and that the movement ranges of some feature points are different. So a

window size adaptation technique is proposed.

Examples of using the changed and unchanged window sizes are shown in Figure 3.1: Figures 3.1(a) and 3.1(b) are results of applying a constant window size, and Figures 3.1(c) and 3.1(d) are those of applying dynamically changed window sizes. We can find that by the former scheme the points are tracked erroneously to stay at the same position, as shown in Figure 3.1(b), and that by the latter scheme the points are tracked correctly to be at the edge of the mouth, as shown in Figure 3.1 (d).

(a) (b)

constant window size. (b) The 72^th frame of a video using a constant window size. (c) The 69^th frame of a video using dynamically changed window sizes. (d) The 72^th frame of a video using dynamically changed window sizes.

3.1.2 Necessity of Corrections of Facial Feature Point Positions

When a person in a video model says “a-u” as shown in Figure 3.2, we can find that the mouth is shrinking and the inner upper mouth part has more and more

wrinkles. Another finding is that the outer upper mouth part is brightening. One thing deserves to be mentioned is that the skin of the inner mouth part will be revealed so that the points of the inner upper mouth part looks like moving up, as shown in Figures 3.2(a) through 3.2(d).

Due to such changing image information, including the shape, brightness, and texture, the image matching is unreliable; therefore, we must correct the positions of feature points when the mouth of a video model has the shapes of “a” and “o.” A wrong matching result is shown in Figure 3.2(e) from which it is seen that after connecting the points, the mouth shape becomes an opened one, but it is in fact a closed mouth. After applying the proposed correction technique, the points of the inner mouth part are located on correct positions, as shown in Figure 3.2(f).

(a) (b) (c)

(d) (e) (f)

Figure 3.2 Facial feature point tracking result of mouth shape of a person saying “u.” (a) Tracking result of 34^th frame of a video. (b) Tracking result of 37^th frame of a video. (c) Tracking result of 40^th frame of a video. (d) Tracking result of 43^th frame of a video. (e) Connecting the points in the 43^th frame of a video. (f) The 43^th frame of a video after correction using proposed method.

3.1.3 Tracking Process

In the proposed method, we track the facial feature points in the frames using the size changing information of a mouth, which is acquired from the difference between the size of the mouth in the current frame and that of the previous frame. The changing information represents the mouth movements so that we can know the mouth states. Then, we edit the size of the content window and the search window, and correct the positions of the feature points according to the mouth states. The flowchart of the proposed feature point tracking process is shown in Figure 3.3.

Figure 3.3 Flowchart of the proposed feature point tracking method.

3.2 Definition of Mouth States Using Mouth Size Changing Information

We propose to use the facial animation parameter units MW0 and MH0, which are the width and the height of a mouth, to represent mouth movements, as shown in Figure 3.4. First, we define some mouth states to indicate how the mouth moves. We only care about some frames, in which, the size of the mouth is different from that of the previous frame. These frames are called changed frames.

The width difference wDiff of the mouth of the current frame from that of the previous frame, and the height difference hDiff of the mouth of the two frames, are used to represent the changed size of the mouth. Two states we define for use in the proposed technique are: opening state and closing state, and they are described in the following.

MW0 MH0

Figure 3.4 The FAPUs in the proposed system.

3.2.1 Mouth states

The opening state represents that a mouth is opening. The criteria for judging an opening state are that hDiff of the current changed frame is larger than zero, and that hDiff of the previous changed frame or wDiff of the current frame is larger than zero.

The closing state represents that a mouth is closing. The criteria for judging a

closing state are that one of wDiff and hDiff of frames, including the current changed frame and the previous changed frame, is smaller than zero.

According to these criteria, we can label states to every frame. A line chart for illustrating this is shown as Figure 3.5, where the 32^th through 46^th frames are assigned the closing state.

For example, if the 32^th frame is the currently-processed frame and if we compare the values wDiff of the 31^th frame and the 32^th one, then according to the previously-mentioned criteria the 32^th frame is assigned the closing state.

Figure 3.5 Aline chart of the frames of the closing state from the 32^th through the 46^th frames of the video model.

3.2.2 Detection of Mouth states

We compare wDiff and hDiff of the current frame with those of the last changed frame which are denoted as pre_wDiff and pre_hDiff. In other words, wDiff, hDiff, pre_wDiff, and pre_hDiff are the mouth size changing information.

Based on the previously-mentioned criteria, the detail of the proposed technique for mouth state detection is described in the following algorithm.

Algorithm 3.1. Detecting the mouth states using mouth size changing information.

Input: A video model Vmodel and locations Lfp of the feature points of the first frame of V_model.

Output: The mouth states S of every frame.

Steps:

1. For every frame Fcurrent of Vmodel, perform the following steps with the initial value of S set none.

1.1 For points 8.4, 8.9, 8.3, and 8.2, apply an image matching technique to extract their corresponding points of Fcurrent using Lfp, and then update Lfp according to the locations of these extracted points of Fcurrent.

1.2 Compute MW0 and MH0 of Fcurrent in the following way:

MW0 = 8.3.x – 8.4.x;

MH0 = 8.2.y – 8.9.y.

Then, denote MW0 and MH0 of Fprevious as MW0′ and MW0′.

1.3 Calculate the difference of the mouth size between frames Fprevious and Fcurrent

by the following way:

wDiff = MW0 − MW0′;

hDiff = MH0 − MH0′.

2. Assign a mouth state to S by comparing wDiff, hDiff, pre_wDiff, and pre_hDiff in the following way:

if wDiff = 0 and hDiff = 0, then S is unchanged;

if wDiff > 0 and hDiff > 0, then set S = Opening state;

if hDiff > 0 and pre_hDiff > 0, then set S = Opening state;

if wDiff < 0 and pre_wDiff < 0, then set S = Closing state;

if hDiff < 0 and pre_hDiff < 0, then set S = Closing state.

3. Update pre_wDiff and pre_hDiff with wDiff and hDiff if both of wDiff and hDiff are

not equal to 0.

For example, if wDiff and pre_wDiff are both larger than zero, it means that the mouth is opening horizontally.

3.3 Image Matching Using Correlation Coefficients Using Dynamically Changed Window Size

The details of using dynamically changed window sizes are described in this section. The origin P of the content window is set at the center of the window, and the origin of the search window is at the left top. The distances from P to the four borders of the content window are taken to be [Sstart, Send, Tstart, Tend], as shown in Figure 3.6.

The content widow moves around and inside the search window of image f. The range the content window can move is taken to be [Xstart+Xend, Ystart+Yend]. The center of the search window has the same coordinates as those of P.

We propose to edit the distance values, including Sstart, Send, Tstart, Tend, Xstart, Xend, Y_start, and Yend, to achieve the goal of changing sizes of the content window and search window.

After changing these distance values, we can use them as parameters to the previously-mentioned image matching technique in Section 1.2.2. We compute a value of γ each time the content window moves one pixel, so we have to compute (Xstart+Xend) × (Ystart+Yend) times in a session of content search. And Equation (1.1) can be written as follows:

Figure 3.6 An mechanics of image matching using dynamically changed window size.

3.3.1 Initial Search Window Size and Content Window Size

The resolution in our video models is 640×480. The initial content size is set to be 35×35, and the initial search window size is set to be 41×41. An illustration of initial windows is shown in Figure 3.7. In addition, we define two variables addX and addY by the points 2.2 and 8.1 of the first frame of the video model, which can be added with or assigned to the distance values. More specifically, we assign the initial distance values, the value of addY, and that of addX in the following way:

(1) Windowsearch= the width of the search window;

(2) Windowcontent= the width of the content window;

(3) Sstart, Send, Tstart, and Tend = (Windowcontent − 1) / 2;

(4) Xstart, Xend, Ystart, and Yend = (Windowsearch −1) / 2;

(5) addX = Upper lip H = 2.2.y − 8.1.y;

(6) addY= addX × 2.

And we specify the initial values of the distance values of the inner-mouth feature points by the following way:

(1) Tstart of point 2.2 = addY; (Windowsearch- Windowcontent)/ 2

2.9

Figure 3.7 An illustration of initial window size. (a) Initial search window size. (b) Initial content window size.

3.3.2 Content Window Size of Opening State

In an opening state, we wish the inner upper mouth part to contain more corner information, so we enlarge the height of their content windows. And we hope the inner bottom mouth to contain more lip information, so we move their content windows to the center mouth and move the center P to the edge of the content windows, as shown in Figure 3.8. Because the input facial image is a neutral facial image with a closed mouth which is going to open, the initial state is set to the opening state. We specify the distance values of the inner-mouth feature points by the following way:

(1) Tstart of points 2.7 and 2.6 = addY;

(2) Sstart of point 2.9= 1;

(3) Send of points 2.9 = Windowcontent − 1;

(4) Sstart of point 2.8= Windowcontent − 1;

(5) Send of point 2.8 = 1.

Figure 3.8 An illustration of content window size of opening state.

3.3.3 Content Window Size of Closing State

In a closing state, the desire content window size is opposite to that for an opening state. We wish the inner mouth to contain less skin information, so we reduce the height of the content window of the inner upper mouth part and move the content window of the inner bottom mouth part back to the initial position, as shown in Figure 3.9. We specify the distance values of the inner-mouth feature points by the following way:

(1) Tstart of points 2.7 and 2.6 = addX;

(2) Sstart, of point 2.9= (Windowcontent – 1) / 2;

(3) Send of points 2.9 = (Windowcontent – 1) / 2;

(4) Sstart of point 2.8= (Windowcontent – 1) / 2;

(5) Send of point 2.8 = (Windowcontent – 1) / 2.

Figure 3.9 An illustration of content window size of closing state.

3.3.4 Balancing Feature Point Position by Changing Search Window Size

A mouth has symmetrical feature points in a mouth model, but not in real-face

video models. If we do not adjust the positions, the virtual face creation will create a virtual face with a crooked mouth, according to our experimental experience. We propose in this study an adaptive image matching technique to make feature point locations to be symmetric in position.

We wish the content window to move only in a vertical way, as shown in Figure 3.10, with the vertical move range being from P′ to P′′. In order to move vertically, we set the distance values of Xend equal to that of Xstart so that the width of the search window is equal to the width of the content window.

Figure 3.10 Illustration of balancing feature point positions by changing search window size.

First, we extract the positions of points 8.4, 8.9, 8.1, 8.10, and 8.3, as shown in Figure 3.11(a). And set the Xstart of points 2.2, 2.3, and 8.2 to be 8.1.x, as shown in Figure 3.11(b). Second, we set the Xstart of other points in the following way, as shown in Figure 3.11(c) through Figure 3.11(e).

(1) Set Xstart of points 2.7, 2.9, and 8.8 = Average (8.4.x, 8.1.x);

(2) Set Xstart of points 2.6, 2.8, and 8.7 = Average (8.1.x, 8.3.x);

(3) Set Xstart of point 8.6 = Average (8.4.x, 8.9.x);

(4) Set Xstart of point 8.5 = Average (8.10.x, 8.3.x);

(5) Set Xstart of point P84_88 = 8.4.x + 0.25×Length (8.4.x, 8.2.x);

(6) Set Xstart of point P88_82 = 8.4.x + 0.75×Length (8.4.x, 8.2.x);

(7) Set Xstart of point P82_87 = 8.2.x + 0.25×Length (8.2.x, 8.3.x);

(8) Set Xstart of point P87_83 = 8.2.x + 0.75×Length (8.2.x, 8.3.x).

(a) (b)

Figure 3.11 A illustration of setting the value of Xstart and Xend.

3.4 Detection of Closed-Mouth Shapes

We detect closed-mouth shapes to correct the feature point positions. Although the mouth seems to be unchanged while the mouth is opening, in fact their shapes in the frames are different from one another, as shown in Figures 3.12(b) through

3.12(d). When a mouth is nearly closed, point 2.7 is closed to point 2.9, and point 2.6 is closed to point 2.8, so are points 2.2 and 2.3, as shown in Figure 3.12(a). At this time, it needs to the correct the feature point positions.

In this study, we define three types of closed-mouth shapes, which will be described in Sections 3.4.1, 3.4.2, and 3.4.3.

(a) (b) (c) (d)

Figure 3.12 An example of closed-mouth shapes. The mouth is opening.

After defining the types of closed-mouth shapes, the next step is to check if correction of feature point positions needs to be done or not. We make the decision for this according to whether frames have closed-mouth shapes or not. The detailed method for detecting closed-mouth shapes is described in the following algorithm.

Algorithm 3.2. Detection of closed-mouth shapes.

Input: A frame F of a video model Vmodel.

Output: A Boolean set Smouth{S1, S2, S3} of the frame F, with Si describing the type of the detected mouth shape .

Steps:

1. Compute the heights h1, h2, and h3 of inner mouth by:

h1 = abs (2.7.y − 2.9.y);

h2 = abs (2.6.y − 2.8.y);

36 labeled 0, it represents that the mouth does not have a closed-mouth shape.

3.4.1 Type-1 Closed-Mouth Shape

When the distance of points 2.7.y and 2.9.y is smaller than one, we call this mouth shape as type-1 closed-mouth shape, as illustrated by Figure 3.13.

(a) (b)

Figure 3.13 Diagrams of type-1 closed-mouth shape. (a) The left points of inner mouth. (b) An example of type-1 closed-mouth shape.

3.4.2 Type-2 Closed-Mouth Shape

When the distance of points 2.6.y and 2.8.y is smaller than one, we call this 2.7

2.9

mouth shape as type-2 closed-mouth shape, as illustrated by Figure 3.14.

2.6

2.8

(a) (b)

Figure 3.14 Diagrams of type-2 closed-mouth shape. (a) The right points of inner mouth. (b) An example of type-2 closed-mouth shape.

3.4.3 Type-3 Closed-Mouth Shape

When the distance of points 2.2.y and 2.3.y is smaller than one, we call this mouth shape as type-3 closed-mouth shape, as illustrated by Figure 3.15.

(a) (b)

Figure 3.15 Diagrams of type-3 closed-mouth shape. (a) The middle points of inner mouth. (b) An example of type-3 closed-mouth shape.

3.5 Correction of Feature Point

2.2 2.3

2.6 2.8

Locations of Closed Mouth

Before correction of the locations of the feature points of the closed mouth, we describe the idea of such correction in the green channel in Section 3.5.1. Then we describe how we extract mouth information by edge detection and bi-level thresholding in the green channel in Section 3.5.2. Finally, the proposed correction process is described in Section 3.5.3.

3.5.1 Idea of Correction in Green Channel

Because the green values of the pixels of a mouth are much smaller than those of the facial skin, as shown in Figure 3.16, it is easy to distinguish the mouth from the facial skin and the teeth. We therefore propose using the green channel to extract the mouth information.

(a) (b) (c)

Figure 3.16 The RGB channel images of partial part of 15^th frame of a video model. (a) Red-channel image. (b) Green-channel image. (c) Blue-channel image.

3.5.2 Edge Detection and Bi-level Thresholding in Green Channel

The proposed system performs edge detection to check if the mouth has a closed-mouth shape, as described in the following algorithm.

Algorithm 3.3. Edge Detection by applying the sobel operator and bi-level

thresholding in green channel.

Input: A frame F of a video model Vmodel and a threshold value t for edge value thresholding.

Output: A binary image B.

Steps:

1. Take the green-channel image G of F and let G(x, y) denote the green value at pixel (x, y).

2. Detect edges in G by applying the following sobel operator, as shown in Figure 3.17, to implement Equation (3.5) below to get an edge image Bedge:

( ) ( ) ( )

3. Threshold Bedge with t as the threshold value to get a binary image B(x, y) by the following equation:

After the execution of the above algorithm, pixels of B(x, y) labeled 1 correspond to edge pixels.

-1 -2 -1 -1 0 1

0 0 0 -2 0 2

1 2 1 -1 0 1

Figure 3.17 Sobel operators.

3.5.3 Correction Process

The final step in feature point tracking is to correct feature points in frames which have closed-mouth shapes. The detail of the correction process is described in

在文檔中使用視訊模型從單一影像自動產生有動態嘴形動作的虛擬人臉之研究 (頁 29-0)