Thesis Organization - 使用視訊模型從單一影像自動產生有動態嘴形動作的虛擬人臉之研究

Chapter 1 Introduction

1.5 Thesis Organization

The remainder of the thesis is organized as follows. Chapter 2 describes an overview of the proposed technique for virtual face creation. Chapter 3 presents the proposed technique for tracking facial feature points automatically. Chapter 4 describes the proposed technique of creation of virtual faces with dynamic mouth movements. In Chapter 5, some experimental results and discussions are described.

Finally, conclusions and suggestions for future works are included in Chapter 6.

Chapter 2 Overview of Proposed Method for Virtual Face Creation

2.1 Idea of Proposed Method

The virtual face creation system proposed in this study is like a black box. The input to it is a single facial image and the output is a facial image sequence. In other words, it is like to make a human face in a single image to laugh or talk in an artificially-created image sequence or video. We use real-face video models to achieve this goal, so the input image will do the same mouth movements as the models. The system is described in more detail in the following.

First, we propose a technique to analyze video models to get the mouth movement information. Because some mouth movements have quite different mouth shapes from others, such as those of “u” and “o,” it is not easy to conduct image matching for such mouth movements. Besides, image matching has another problem which occurs when a closed mouth is opening or when an opened mouth is closing, that is, the teeth will appear or disappear alternatively to interfere with the correctness of image matching. So we propose a novel image matching technique to deal with such a problem of interference coming from changed mouth-shape and teeth appearances.

After creating a virtual mouth by a morphing technique, it may be bigger or smaller than the mouth of the input image which has a closed mouth. For example, the

virtual mouth will be bigger than the mouth of the input image when a person in the video model says the letter “a” for which the mouth is opening and the chin is moving down. If we just paste the mouths on the input image, the resulting image will have clear edges at the pasted mouth boundary. So we propose a technique to extract the mouth region and smooth its edges before integrating the mouth with the input image.

In this chapter, the techniques proposed to achieve the goals mentioned above are described. First, a review of Chen and Tsai [1] constructing a face model adapted from [16] is given in Section 2.2. Construction of a mouth model and uses of mouth features based on the adapted face model are described in Section 2.3. Finally, a technique is proposed to create virtual faces from sequential images, which is given in Section 2.4. More detailed descriptions of the involved steps of the techniques will be described in Chapters 3 and 4.

2.2 Review of Adopted Face Model

Chen and Tsai [1] proposed a method to generate cartoon faces automatically from neutral facial images. Before cartoon face generation, a face model with facial feature points was defined first. Ostermann [16] specified the 84 feature points and the facial animation parameter units (FAPUs) of the face model used in the MPEG-4 standard, as shown in Figures 2.1 and 2.2. However, this face model is not suitable for cartoon face drawing. Chen and Tsai [1] defined accordingly an adapted face model with 72 feature points by adding or eliminating some feature points of the face model used in the MPEG-4. Also, some FAPUs were specified according to the MPEG-4 standard. An illustration of the proposed adapted face model is shown in Figure 2.3.

Figure 2.1 84 feature points in MPEG-4.

Figure 2.2 FAPUs in MPEG-4.

Chen and Tsai [1] assigned some feature points as control points to control facial expressions of the cartoon face. These control points are also called face model control points in this study, which are listed as follows.

1. Eyebrow control points: there are 8 control points in both eyebrows, namely, 4.2, 4.4, 4.4a, 4.6, 4.1, 4.3, 4.3a, and 4.5.

2. Eye control points: there are 4 control points in eyes, namely, 3.1, 3.3, .3.2, and 3.4.

3. Mouth control points: there are 4 control points in the mouth, namely, 8.9, 8.4, 8.3, and 8.2, by which other mouth feature points are computed.

4. Jaw control point: there is one control point in the jaw, namely, 2.1, which is automatically computed by the position of the control point 8.2 and the value of the facial animation parameter JawH.

(a) (b)

Figure 2.3 A adapted face model. (a) Proposed 72 feature points. (b) Proposed FAPUs in Chen and Tsai [1].

2.3 Construction of Mouth Model and Uses of Mouth Features

Construction of the mouth model based on the above-mentioned adapted face model is introduced in Section 2.3.1. Uses of mouth feature regions are illustrated in Section 2.3.2. And uses of mouth control points are illustrated in Section 2.3.3.

2.3.1 Construction of Mouth Model Based on Adapted Face Model

The use of mouth feature points helps us to create virtual faces. It also helps us to find the mouth feature regions which are defined by groups of mouth feature points.

Also, the use of mouth feature points can compress the large volume of image files into meaningful points. Before locating positions of feature points, we must define a model to make the feature points meaningful.

In this study, we propose a mouth model based on the face model in the MPEG-4 standard and the adapted face model used in Chen and Tsai [1] by adding and eliminating some feature points. The inner mouth feature points including 2.7, 2.2, 2.6, 2.9, and 2.8 are used in the proposed mouth model. And we define some additional points to make the bottom lip smoother, which include P84_88, P88_82, P82_87 and P87_83, shown as orange dots in Figure 2.4. Furthermore, in the proposed mouth model, we add some additional points again to help morphing, which are marked as blue dots in the entire proposed model shown in Figure 2.5.

Figure 2.4 Mouth Feature Points used in the proposed method

9.15

Figure 2.5 Entire proposed mouth model. The blue dots in the model are added to help morphing, and the red dots are control points.

2.3.2 Mouth Feature Regions

The use of mouth feature regions tells us the feature information such as the position, size, and range. An example of the bottom part of a virtual face is shown in Figure 2.6(a), which was created by using an Angelina Jolie’s photo as the input image. As shown in Figure 2.6(b), the mouth region to be pasted on the input image is composed of the skin region, the lip region, and the teeth region, as shown in Figures 2.6(c) through 2.6(d). Mouth movements affect the range of the skin region, the size

of the lip region, and the size of the teeth region. The teeth information is obtained from the video model.

(a) (b) (c) (d) (e)

Figure 2.6 Mouth feature regions used in the proposed method. (a) Bottom part of a virtual face. (b) The mouth region. (c) The skin region outside the mouth. (d) The lip region. (e) The teeth region.

2.3.3 Mouth Control Points

Some feature points are treated as control points which can decide the size of a mouth and the range of a mouth region. By controlling the positions of the control points, a virtual face will have mouth movements and look like being able to talk. In this study, we propose a technique to reassign the positions of these points to achieve this goal, shown as red dots in Figure 2.5.

2.4 Virtual-Face Creation Process from Sequential Images

Figure 2.7 illustrates a flowchart of the stages of proposed virtual face creation from sequential images. First, a neural facial image and the first frame of a real-face video model are used as inputs to a feature point locator. After the work of feature point location is accomplished by the locator, the remaining frames of the video

model and the feature points of the first frame are used as inputs to a feature point tracker.

Then, the feature point tracker tries to extract the feature points of the remaining frames of the video model. Here the problems we mentioned in Section 2.1 are found to happen often when a closed mouth is opening or when an opened mouth is closing.

So we propose to detect the states of the mouths, including the two above-mentioned states: the opening state and the closing state, and an unchanged state meaning that the mouth size in the current frame is same as that in the previous frame. Then, we use the information of the mouth states to change the matching area dynamically to reduce incorrect matching results. The area changing technique is called window size editing in the following.

When an opened mouth is shrunk gradually to be a closed mouth, the positions of the feature points of the inner upper mouth part sometimes will become different from those of the bottom inner mouth part. So we propose a technique to detect closed-mouth shapes and move the positions of the feature points of the inner mouth part to certain correct positions we want.

We also propose a technique to track feature points in a frame according to the image information in the previous frame. If the feature points in the previous frame are located on wrong positions, the tracker will track the wrong points in the remaining frames in the video model. Feature point correction so is necessary to make sure that the positions of the feature points are all correct; otherwise, feature point tracking will fail, according to our experimental experience.

The virtual face creator we propose will then divide and morph the mouth shapes to get the bottom part of every virtual face. The final step is to extract the mouth region from the virtual face and integrate it with the input image. This completes the proposed process for virtual face creation.

20 Camera

Image Sequences

Single Facial Images Real-Face Video Models

Feature Point Locator

Virtual Face Creator Feature Point Tracker Mouth State Detecting

Window Size Editing

Image Matching

Closed-Mouth Shape Detecting

Facial Feature Point Correcting

Mouth Shape Division

Mouth Shape Morphing

Extraction of Mouth Region

Figure 2.7 Stages of proposed virtual face creation from sequential images.

Chapter 3 Tracking of Facial Feature Points

3.1 Idea of Proposed Techniques

As mentioned in Section 1.2.2, during tracking of facial feature points, suppose that a subimage w at coordinates (s, t) within an image f is processed. Then, the moving range of w inside f is taken to be [2s+1, 2t+1] in this study. The region of w is called a content window, and the moving range is called a search window. We propose in this study an image matching technique using the mouth movement information to change the size of the content window and the search window.

Applying this technique, we can solve the interference problem of changed mouth-shape and teeth appearances mentioned in Section 2.1.

In this chapter, the necessity of changes of content and search window sizes and correction of facial feature point positions are explained in Section 3.1.1 and Section 3.1.2, respectively. Finally, the proposed method for tracking facial feature points is described in Section 3.1.3.

3.1.1 Necessity of Changes of Window Sizes

Because the mouth shapes are not all the same during a human’s talking process, the content window sometimes will include insufficient or too much information for image matching. Two other reasons for using different window sizes for each feature point are that the teeth will interfere the matching process in the tracking of some feature points and that the movement ranges of some feature points are different. So a

window size adaptation technique is proposed.

Examples of using the changed and unchanged window sizes are shown in Figure 3.1: Figures 3.1(a) and 3.1(b) are results of applying a constant window size, and Figures 3.1(c) and 3.1(d) are those of applying dynamically changed window sizes. We can find that by the former scheme the points are tracked erroneously to stay at the same position, as shown in Figure 3.1(b), and that by the latter scheme the points are tracked correctly to be at the edge of the mouth, as shown in Figure 3.1 (d).

(a) (b)

constant window size. (b) The 72^th frame of a video using a constant window size. (c) The 69^th frame of a video using dynamically changed window sizes. (d) The 72^th frame of a video using dynamically changed window sizes.

3.1.2 Necessity of Corrections of Facial Feature Point Positions

When a person in a video model says “a-u” as shown in Figure 3.2, we can find that the mouth is shrinking and the inner upper mouth part has more and more

wrinkles. Another finding is that the outer upper mouth part is brightening. One thing deserves to be mentioned is that the skin of the inner mouth part will be revealed so that the points of the inner upper mouth part looks like moving up, as shown in Figures 3.2(a) through 3.2(d).

Due to such changing image information, including the shape, brightness, and texture, the image matching is unreliable; therefore, we must correct the positions of feature points when the mouth of a video model has the shapes of “a” and “o.” A wrong matching result is shown in Figure 3.2(e) from which it is seen that after connecting the points, the mouth shape becomes an opened one, but it is in fact a closed mouth. After applying the proposed correction technique, the points of the inner mouth part are located on correct positions, as shown in Figure 3.2(f).

(a) (b) (c)

(d) (e) (f)

Figure 3.2 Facial feature point tracking result of mouth shape of a person saying “u.” (a) Tracking result of 34^th frame of a video. (b) Tracking result of 37^th frame of a video. (c) Tracking result of 40^th frame of a video. (d) Tracking result of 43^th frame of a video. (e) Connecting the points in the 43^th frame of a video. (f) The 43^th frame of a video after correction using proposed method.

3.1.3 Tracking Process

In the proposed method, we track the facial feature points in the frames using the size changing information of a mouth, which is acquired from the difference between the size of the mouth in the current frame and that of the previous frame. The changing information represents the mouth movements so that we can know the mouth states. Then, we edit the size of the content window and the search window, and correct the positions of the feature points according to the mouth states. The flowchart of the proposed feature point tracking process is shown in Figure 3.3.

Figure 3.3 Flowchart of the proposed feature point tracking method.

3.2 Definition of Mouth States Using Mouth Size Changing Information

We propose to use the facial animation parameter units MW0 and MH0, which are the width and the height of a mouth, to represent mouth movements, as shown in Figure 3.4. First, we define some mouth states to indicate how the mouth moves. We only care about some frames, in which, the size of the mouth is different from that of the previous frame. These frames are called changed frames.

The width difference wDiff of the mouth of the current frame from that of the previous frame, and the height difference hDiff of the mouth of the two frames, are used to represent the changed size of the mouth. Two states we define for use in the proposed technique are: opening state and closing state, and they are described in the following.

MW0 MH0

Figure 3.4 The FAPUs in the proposed system.

3.2.1 Mouth states

The opening state represents that a mouth is opening. The criteria for judging an opening state are that hDiff of the current changed frame is larger than zero, and that hDiff of the previous changed frame or wDiff of the current frame is larger than zero.

The closing state represents that a mouth is closing. The criteria for judging a

closing state are that one of wDiff and hDiff of frames, including the current changed frame and the previous changed frame, is smaller than zero.

According to these criteria, we can label states to every frame. A line chart for illustrating this is shown as Figure 3.5, where the 32^th through 46^th frames are assigned the closing state.

For example, if the 32^th frame is the currently-processed frame and if we compare the values wDiff of the 31^th frame and the 32^th one, then according to the previously-mentioned criteria the 32^th frame is assigned the closing state.

Figure 3.5 Aline chart of the frames of the closing state from the 32^th through the 46^th frames of the video model.

3.2.2 Detection of Mouth states

We compare wDiff and hDiff of the current frame with those of the last changed frame which are denoted as pre_wDiff and pre_hDiff. In other words, wDiff, hDiff, pre_wDiff, and pre_hDiff are the mouth size changing information.

Based on the previously-mentioned criteria, the detail of the proposed technique for mouth state detection is described in the following algorithm.

Algorithm 3.1. Detecting the mouth states using mouth size changing information.

Input: A video model Vmodel and locations Lfp of the feature points of the first frame of V_model.

Output: The mouth states S of every frame.

Steps:

1. For every frame Fcurrent of Vmodel, perform the following steps with the initial value of S set none.

1.1 For points 8.4, 8.9, 8.3, and 8.2, apply an image matching technique to extract their corresponding points of Fcurrent using Lfp, and then update Lfp according to the locations of these extracted points of Fcurrent.

1.2 Compute MW0 and MH0 of Fcurrent in the following way:

MW0 = 8.3.x – 8.4.x;

MH0 = 8.2.y – 8.9.y.

Then, denote MW0 and MH0 of Fprevious as MW0′ and MW0′.

1.3 Calculate the difference of the mouth size between frames Fprevious and Fcurrent

by the following way:

wDiff = MW0 − MW0′;

hDiff = MH0 − MH0′.

2. Assign a mouth state to S by comparing wDiff, hDiff, pre_wDiff, and pre_hDiff in the following way:

if wDiff = 0 and hDiff = 0, then S is unchanged;

if wDiff > 0 and hDiff > 0, then set S = Opening state;

if hDiff > 0 and pre_hDiff > 0, then set S = Opening state;

if wDiff < 0 and pre_wDiff < 0, then set S = Closing state;

if hDiff < 0 and pre_hDiff < 0, then set S = Closing state.

3. Update pre_wDiff and pre_hDiff with wDiff and hDiff if both of wDiff and hDiff are

not equal to 0.

For example, if wDiff and pre_wDiff are both larger than zero, it means that the mouth is opening horizontally.

3.3 Image Matching Using Correlation Coefficients Using Dynamically Changed Window Size

The details of using dynamically changed window sizes are described in this section. The origin P of the content window is set at the center of the window, and the origin of the search window is at the left top. The distances from P to the four borders of the content window are taken to be [Sstart, Send, Tstart, Tend], as shown in Figure 3.6.

The content widow moves around and inside the search window of image f. The range the content window can move is taken to be [Xstart+Xend, Ystart+Yend]. The center of the search window has the same coordinates as those of P.

We propose to edit the distance values, including Sstart, Send, Tstart, Tend, Xstart, Xend, Y_start, and Yend, to achieve the goal of changing sizes of the content window and search window.

After changing these distance values, we can use them as parameters to the previously-mentioned image matching technique in Section 1.2.2. We compute a value of γ each time the content window moves one pixel, so we have to compute (Xstart+Xend) × (Ystart+Yend) times in a session of content search. And Equation (1.1) can be written as follows:

Figure 3.6 An mechanics of image matching using dynamically changed window size.

3.3.1 Initial Search Window Size and Content Window Size

The resolution in our video models is 640×480. The initial content size is set to be 35×35, and the initial search window size is set to be 41×41. An illustration of initial windows is shown in Figure 3.7. In addition, we define two variables addX and addY by the points 2.2 and 8.1 of the first frame of the video model, which can be

在文檔中使用視訊模型從單一影像自動產生有動態嘴形動作的虛擬人臉之研究 (頁 22-0)