Correcting Erroneous Matching Results - Learning of Base Regions

Chapter 4 Automatic Learning of Facial Features

4.3 Learning of Base Regions

4.3.5 Correcting Erroneous Matching Results

Although the results of the image matching technique are stable, however, some errors still exist due to unavoidable changes in lighting. The variations of shadows on faces also affect the accuracy of the results.

1*rotated

Fig. 4.12 Flowchart of image matching with sub-pixel precision on rotated faces.

For instance, Fig. 4.13(a) shows an example of the result obtained using the proposed method in Section 4.3.4. Corresponding trajectories of the x-axis, the y-axis and the rotated angle are shown in Fig. 4.13(b). Red parts of the trajectories represent situations that the position of the base region suddenly moves forward along a direction and then back. For example, the x-axis of the base region of frame 2 is 344.0, and then it moves right to 344.5 on frame 3, and then it moves left to 344.0 on frame 4.

This kind of situation is not normal since the face often shakes smoothly.

To solve this problem, a method is proposed to correct the erroneous values. It simply deals with the following situation: the position of the base region suddenly moves forward along a direction and then back in three frames. In this situation, the base region position of the 2^nd frame is corrected with the center of the base regions positions of the 1^st and the 3^rd frame.

Frame X Y Angle Fig. 4.13 Example of erroneous matching results. (a) The x- and y-axis and rotated

angle detected by image matching techniques. (b) The trajectories of the x-axis, the y-axis, and the rotated angle.

The entire algorithm of correcting erroneous matching results is described as follows:

Algorithm 4. Correcting erroneous matching results.

Input: a sequence of triples (x_i, y_i, A_i) = { (x₁, y₁, A₁), (x₂, y₂, A₂), …, (x_N, y_N, AN)}, where xi, yi, and Ai represent the x-axis, the y-axis and the rotated angle of the base region of frame i, respectively.

Output: a sequence of corrected triples (xi, yi, Ai) = { (x1, y1, A1), (x2, y2, A2), …,

Step 3: Set ycurrent to (ycurrent-1＋y_current+1)/2, if (ycurrent－y_current-1)(ycurrent－

y_current+1) > 0.

Step 4: Set Acurrent to (Acurrent-1＋Acurrent+1)/2, if (Acurrent－Acurrent-1)( Acurrent－ A_current+1) > 0.

Step 5: Add current by 1.

Step 6: Repeat Steps 1 through 5 until current is larger than N.

Fig. 4.14 shows the result of Fig. 4.13(a) after correcting erroneous matching results. The trajectories shown in Fig. 4.14(b) are smoother and more natural.

Frame X Y Angle

1 344.0 280.0 0

2 344.0 280.5 0

3 344.0 280.5 0

4 344.0 281.0 0

5 344.0 281.5 0

(a) (b) Fig. 4.14 Example of corrected matching results. (a) The values in Fig. 4.13(a) are

corrected. (b) The corrected trajectories of the x-axis, the y-axis, and the rotated angle.

4.4 Experimental Results

In this section, some experimental results of the proposed methods described in this chapter are shown. Fig. 4.15 shows a sequence of image frames in a recorded video. Positions and rotated angles of base regions detected using the method

proposed in Section 4.3.4 are listed in Table 4.1(a), and the corrected values using the method proposed in Section 4.3.5 are listed in Table 4.1(b).

Fig. 4.15 A sequence of image frames in a recorded video.

Table 4.1 Positions and rotated angles of base regions of frames in Fig. 4.15. (a) Uncorrected values. (b) Corrected values.

Frame X Y Angle

4.5 Summary and Discussions

In this chapter, the automation of facial feature learning was emphasized. Since the positions of these features can be calculated according to the position of the base region, detection of the base region automatically becomes very important. The image matching technique may be used to do this job. Proposed methods modify the technique so that it can be performed on rotated faces with sub-pixel accuracy. A method is also proposed to correct erroneous matching results.

Chapter 5 Virtual Talking Face Animation Generation

5.1 Introduction

In Chapters 3 and 4, the audio features, the base image sequences, and the facial features are collected using the proposed methods. With the help of these features, virtual talking face animations with synchronized utterances can be generated. Fig.

5.1 shows a block diagram for the proposed animation generation process. “Proper”

frames are generated according to the timing information of the input audio and the feature information in the viseme database. Finally the input audio and the generated frames are combined together to produce an animation. It is obvious that generating

“proper” frames is a very important work. Badly generated frames may lead to asynchronous problems between the audio and the frames, or unnatural animations.

Fig. 5.1 Block diagram of proposed animation generation process.

Fig. 5.2 illustrates the frame generation process proposed by Lin and Tsai [6].

Firstly the timing information of the syllables of an input speech is obtained. This timing information is used to decide the number of frames to preserve every syllable and the pauses between syllables. The starting frame for every syllable is calculated by accumulating the number of frames of preceding syllables and pauses. Secondly, the corresponding visemes of syllables are prepared. Since the number of frames of a syllable in the viseme database may not equal the number of frames preserved, an algorithm for frame increase and decrease is used. Thirdly, the mouth images of visemes are pasted onto the base images. Finally, transition frames between syllables are replaced with several middle frames to produce smoother animations.

Fig. 5.2 Diagram of frame generation process proposed by Lin and Tsai [6].

The above-mentioned process is effective in generating proper frames for synchronized and smooth animations. However, some problems still exist. In the following sections of this chapter, the problems are illustrated, and some methods are proposed to solve these problems in order to produce higher-quality animations.

5.2 Synchronization Between Frames And Speeches

In Lin and Tsai [6], the starting frame for every syllable is calculated by accumulating the number of frames of the preceding syllables and pauses. To obtain the number of frames of a syllable or pause, the length of it is multiplied by a frames-per-second constant. The constant controls the number of frames appearing within a second. For a standard NTSC video, the constant is set to 29.97.

Since the length of a syllable or a pause is a floating-point number, the calculated number of frames is also a floating-point number. However, the number of frames can only be an integer. Discarding the fraction part leads to a small error. Errors are accumulated and affect the synchronization between the audio and the frames.

To solve this problem, a method of re-synchronization on every syllable is proposed. Firstly, the starting frame of every syllable is calculated in floating-point number, like F0 through F4 in Fig. 5.3. Then the fraction part of these floating-point frames are discarded, therefore results in starting frames of the integer type, like I₀ through I4. The starting frames in the integer type are used as the starting frames of syllables. The number of frames of a syllable is obtained by subtracting the nearest successive starting frames of the integer type. For example, the number of the first syllable (the first green part) is I₁－I₀, and the number of frames for the first pause between syllables (the first blue part) is I2－I1.

The final result of duration of the integer type can be used to generate animations synchronized with the input audio, with at most an error of a frame.

5.3 Frame Generation by Interpolation

In the following two sections, two processes of frame generation for two applications are proposed. In Section 5.3.1, a frame generation process that is suitable for creating virtual talking faces is proposed. And in Section 5.3.2, a frame generation process that is suitable for creating virtual singing faces is described.

Fig. 5.3 Diagram of re-synchronization on every syllable.

5.3.1 Speaking Case

In Section 5.2, the number of frames for every syllable in an input audio is determined. The number of frames cannot be altered, or the asynchronous problem between the audio and image frames will arise. However, the number of frames for an identical syllable in the viseme database may not equal the one desired due to different speaking speeds and must be altered.

In [6], Lin and Tsai proposed an algorithm for frame increase and decrease to complete this job. The algorithm inserts frames at positions where adjacent frames are mostly unlike, and deletes frames at positions where adjacent frames are mostly alike, until the number of frames is equal to the desired one. However, results produced by this algorithm are not reasonable. It is noticed that when a person speaks faster, the shape of his/her mouth changes more violent. On the contrary, when a person speaks

slower, the shape of his/her mouth changes slighter. However, the motion of the mouth retains the same when speaking an identical syllable.

To simulate the motion mentioned above, the idea of interpolation is used. As shown in Fig. 5.4, the original frames are divided into N parts, where N is the number of desired frames. And then, a frame of each part is selected to represent the part.

Finally, the content of the desired frames are replaced with the content of the representative frames one by one.

(a) (b) Fig. 5.4 Idea of frame interpolation. (a) The number of original frames is larger than

that of desired frames. (b) The number of original frames is smaller than that of desired frames.

The process of frame generation of the speaking case is illustrated in Fig. 5.5.

Firstly, the visemes, namely, the mouth images, of syllables are determined using the frame interpolation technique. Secondly, the visemes of pauses between syllables need be decided. When the duration of a pause is long, it is considered that the person would close his/her mouth; otherwise, the person would keep his/her mouth open and unchanged just if the pause does not exist. Thirdly, the visemes of the first or the last pause should be a closed mouth, because the person does not start speaking during the first pause, and he/she closes his/her mouth after the last syllable. Finally, the determined visemes are integrated into the base images.

Fig. 5.5 Frame generation of speaking cases.

5.3.2 Singing Case

Due to certain properties of songs, a singing person often has to utter syllables for longer times, especially when he/she is singing a slow song. The frame interpolation technique proposed in Section 5.3.1 is not suitable to determine the visemes of a lengthy syllable because it would make a mouth change its shape in slow motion, which is not natural.

To solve this problem, several facts are noticed. The first is that a mouth always keeps open while singing songs even during long pauses. The second fact is that after the sound of a syllable is uttered, the mouth would hold its shape unchanged and continue uttering the sound. Before the mouth holds its shape, we call that it is in a

“mouth-opening” phase. When the mouth begins to hold its shape, we call that it is in a “mouth-holding” phase. Fig. 5.6 shows a diagram of these two phases.

The third noticed fact is that the duration of the mouth-opening phase is related to the total duration of a syllable. When the duration of a syllable is longer, the duration of the mouth-opening phase is longer. In Fig. 5.7, experimental results prove

this fact. The duration of the mouth-opening phase using different numbers of beats in a measure is observed. It is shown that when the duration of a syllable is longer, the duration of the mouth-opening phase becomes longer.

Fig. 5.6 The two phases while singing a long syllable.

Fig. 5.7 The duration of the mouth-opening phase of syllables of a same sentence using different beats. (a) The sentence is “紅紅的花開滿了木棉道”. (b) The sentence is “你和我不常聯絡也沒有彼此要求”.

Therefore, the process of frame generation of the speaking case needs to be modified to fit these observed facts. The first modification is that the mouth does not close during pauses. That is, visemes of a pause is the same as the last viseme of the preceding syllable. Orange parts in Fig. 5.8 shows this modification.

The second modification is that the mouth should go through a mouth-opening phase and a mouth-holding phase while singing a long syllable. Suppose the duration of a syllable S in the database is D_d, and that in an input audio of singing is D_a. When Da is larger than Dd, the mouth should utter the sound in a duration of Dopening, and then keep its shape unchanged for a duration of D_a－D_opening. D_opening is defined as follows:

Dopening = Dd ＋ Da / Dd

Fig. 5.6 shows the entire frame generation process of singing cases.

Fig. 5.8 Frame generation of singing cases.

5.4 Smoothing Between Visemes

To create smooth animations, transitions between two successive syllables need

to be taken care. When the mouth shape of the last viseme of the preceding syllable is not quite similar to the one of the first viseme of the rear syllable, some transition visemes should be inserted to smooth the articulation.

Fig. 5.8 shows an experimental result concerning the relationship between the distance of two visemes and the number of required transition frames. It is observed that there is no proportional relationship between the distance and the number of transition frames. However, it is noticed that the average number of transition frames is 2. Besides, we notice that while a person is speaking slower, more transition frames can be inserted because the mouth changes its shape slowly. While a person is speaking faster, fewer transition frames needs to be inserted because the mouth changes its shape rapidly.

Therefore, an algorithm for deciding the number of transition frames between two successive syllables is defined as follows:

Algorithm 5. Deciding the number of transition frames between two syllables.

Input: a preceding syllable S1 and a rear syllable S2, a viseme database D, and an input audio A.

Output: the number of required transition frames N.

Steps:

Table 5.1 Relationship between the distance of two visemes and the number of

5.5 Integration of Mouth Images and Base Images

After all visemes are determined according to the methods proposed in the previous sections, the mouth image of a viseme needs to be integrated into a base image using the alpha-blending technique, as shown in Fig. 5.9. Here a mouth image represents a region on a facial image that contains lips and a jaw. However, the determination of the region is not easy. In [6], Lin and Tsai used a fixed rectangle surrounding the mouth to represent the region. The approach is simple to implement;

however, the determination of the size of the rectangle is not an easy job because it affects integrated results severely. A too wide rectangle that overlaps the background such as the walls may cause the background to be “integrated” into the face. A too short rectangle may cause the jaw to “drop down” while opening the mouth because only part of the jaw is moving.

+ =

(a) (b) (c)

Fig. 5.9 Integration of a base image and a mouth image. (a) A base image. (b) A mouth image. (c) The integrated image.

A method is proposed to determine the region of a mouth image. Suppose that there are two images I1 and I2, and the mouth region of I1 is to be integrated into I2. Then the method goes as follows.

Firstly, since the position and size of the base region are known already using the methods proposed in Chapter 4, we can determine the skin color by averaging the colors in the base region. Secondly, skin regions of I1 and I2 are determined as S1 and S₂, respectively. Finally, the intersection region S_intersect of S₁ and S₂is found and a region growing method is utilized to discard noise. Sintersect is used as the mouth region.

Besides using Sintersect as the mouth region, a trapezoid inside it can also be used as another choice of the mouth region. Fig. 5.10 shows examples of these two kinds of mouth regions.

Fig. 5.10 Example of mouth regions found using the proposed method. (a) The intersection region of skin parts. (b) A trapezoid inside the intersection region.

5.6 Experimental Results

In this section, some experimental results of the proposed methods are shown. In Fig. 5.11, the model is speaking a sentence. And in Fig. 5.12, the model is singing a song.

Fig. 5.11 Result of frame generation of a speaking case. The person is speaking the sentence “夕陽”.

Fig. 5.11 Result of frame generation of a speaking case. The person is speaking the sentence “夕陽”. (Continued)

Fig. 5.12 Result of frame generation of a singing case. The person is speaking the sentence “如果雲知道”. The frames shown are part of “知道”.

Fig. 5.12 Result of frame generation of a singing case. The person is speaking the sentence “如果雲知道”. The frames shown are part of “知道”. (Continued)

5.7 Summary and Discussions

In this chapter, the concentration has been put on improving the quality of generated animations. The first issue was synchronization between a speech and image frames because people can notice the asynchronous problem easily. The

proposed method reduced the synchronization error down to be shorter in time than the period of a frame. To make the animations natural, the behaviors of real talking persons and singing persons were discussed so that generated virtual faces can act in the same way as real human beings. The articulation effect of transitions was also noticed and solved. Finally, mouth regions found by the proposed method were proper to be integrated into base images.

Chapter 6 Examples of Applications 6.1 Introduction

Virtual talking faces can be applied to many areas. For example, they can be used as agents or assistants to help people do their jobs. They can also be used as tutors to help students study.

In this chapter, some examples of applications are described. Section 6.2 shows how virtual announcers that are able to report news are created. Section 6.3 presents virtual singers that can sing songs. In Section 6.4, virtual talking faces are integrated into emails, so that people can watch their friends reading the contents of received emails. In Section 6.5, some other possible applications are listed.

6.2 Virtual Announcers

6.2.1 Description

Virtual announcers are virtual talking faces that can report news. As shown in Fig.

6.1, real news reporters appear on television screens with their faces and part of their bodies seen. Since real news reporters may sometimes get sick or be occupied by other tasks, virtual announcers can take the place of them who shall record news releases in advance. Moreover, they can even replace real news reporters as the techniques are exquisitely applied and they exhibit very realistic appearances.

Fig. 6.1 Examples of real news reporters.

6.2.2 Process of Creation

The process of creating a virtual announcer is shown in Fig. 6.2. Firstly, the video of a speaking person needs to be recorded. Secondly, the feature learning process extracts all required features from the video. These two jobs follow the processes proposed in Chapters 2 through 4. Then, any audio of news can be sent to the animation generation process to create animations of virtual announcers. It is noticeable that the frame generation method for the speaking case proposed in Section 5.3.1 is utilized at this stage.

An example of a virtual announcer is shown in Fig. 6.3. The background may be changed dynamically to fit the news that is being reported.

6.3 Virtual Singers

6.3.1 Description

Similar to virtual announcers, virtual singers make use of virtual faces that are able to sing songs. They can be used for entertainment.

Fig. 6.2 Block diagram of creation of virtual announcers.

Fig. 6.3 Example of a virtual announcer.

6.3.2 Process of Creation

The process of creating a virtual singer is shown in Fig. 6.4. The process is mostly the same as the one of creating a virtual announcer. The only difference is that that the frame generation method for the singing case proposed in Section 5.3.2,

在文檔中自動化建構虛擬說話人臉與其相關應用之研究 (頁 55-0)