Frame Generation by Interpolation - Virtual Talking Face Animation Generation

Chapter 5 Virtual Talking Face Animation Generation

5.3 Frame Generation by Interpolation

In the following two sections, two processes of frame generation for two applications are proposed. In Section 5.3.1, a frame generation process that is suitable for creating virtual talking faces is proposed. And in Section 5.3.2, a frame generation process that is suitable for creating virtual singing faces is described.

Fig. 5.3 Diagram of re-synchronization on every syllable.

5.3.1 Speaking Case

In Section 5.2, the number of frames for every syllable in an input audio is determined. The number of frames cannot be altered, or the asynchronous problem between the audio and image frames will arise. However, the number of frames for an identical syllable in the viseme database may not equal the one desired due to different speaking speeds and must be altered.

In [6], Lin and Tsai proposed an algorithm for frame increase and decrease to complete this job. The algorithm inserts frames at positions where adjacent frames are mostly unlike, and deletes frames at positions where adjacent frames are mostly alike, until the number of frames is equal to the desired one. However, results produced by this algorithm are not reasonable. It is noticed that when a person speaks faster, the shape of his/her mouth changes more violent. On the contrary, when a person speaks

slower, the shape of his/her mouth changes slighter. However, the motion of the mouth retains the same when speaking an identical syllable.

To simulate the motion mentioned above, the idea of interpolation is used. As shown in Fig. 5.4, the original frames are divided into N parts, where N is the number of desired frames. And then, a frame of each part is selected to represent the part.

Finally, the content of the desired frames are replaced with the content of the representative frames one by one.

(a) (b) Fig. 5.4 Idea of frame interpolation. (a) The number of original frames is larger than

that of desired frames. (b) The number of original frames is smaller than that of desired frames.

The process of frame generation of the speaking case is illustrated in Fig. 5.5.

Firstly, the visemes, namely, the mouth images, of syllables are determined using the frame interpolation technique. Secondly, the visemes of pauses between syllables need be decided. When the duration of a pause is long, it is considered that the person would close his/her mouth; otherwise, the person would keep his/her mouth open and unchanged just if the pause does not exist. Thirdly, the visemes of the first or the last pause should be a closed mouth, because the person does not start speaking during the first pause, and he/she closes his/her mouth after the last syllable. Finally, the determined visemes are integrated into the base images.

Fig. 5.5 Frame generation of speaking cases.

5.3.2 Singing Case

Due to certain properties of songs, a singing person often has to utter syllables for longer times, especially when he/she is singing a slow song. The frame interpolation technique proposed in Section 5.3.1 is not suitable to determine the visemes of a lengthy syllable because it would make a mouth change its shape in slow motion, which is not natural.

To solve this problem, several facts are noticed. The first is that a mouth always keeps open while singing songs even during long pauses. The second fact is that after the sound of a syllable is uttered, the mouth would hold its shape unchanged and continue uttering the sound. Before the mouth holds its shape, we call that it is in a

“mouth-opening” phase. When the mouth begins to hold its shape, we call that it is in a “mouth-holding” phase. Fig. 5.6 shows a diagram of these two phases.

The third noticed fact is that the duration of the mouth-opening phase is related to the total duration of a syllable. When the duration of a syllable is longer, the duration of the mouth-opening phase is longer. In Fig. 5.7, experimental results prove

this fact. The duration of the mouth-opening phase using different numbers of beats in a measure is observed. It is shown that when the duration of a syllable is longer, the duration of the mouth-opening phase becomes longer.

Fig. 5.6 The two phases while singing a long syllable.

Fig. 5.7 The duration of the mouth-opening phase of syllables of a same sentence using different beats. (a) The sentence is “紅紅的花開滿了木棉道”. (b) The sentence is “你和我不常聯絡也沒有彼此要求”.

Therefore, the process of frame generation of the speaking case needs to be modified to fit these observed facts. The first modification is that the mouth does not close during pauses. That is, visemes of a pause is the same as the last viseme of the preceding syllable. Orange parts in Fig. 5.8 shows this modification.

The second modification is that the mouth should go through a mouth-opening phase and a mouth-holding phase while singing a long syllable. Suppose the duration of a syllable S in the database is D_d, and that in an input audio of singing is D_a. When Da is larger than Dd, the mouth should utter the sound in a duration of Dopening, and then keep its shape unchanged for a duration of D_a－D_opening. D_opening is defined as follows:

Dopening = Dd ＋ Da / Dd

Fig. 5.6 shows the entire frame generation process of singing cases.

Fig. 5.8 Frame generation of singing cases.

在文檔中自動化建構虛擬說話人臉與其相關應用之研究 (頁 64-69)