Integration of Mouth Images and Base Images

Chapter 5 Virtual Talking Face Animation Generation

5.5 Integration of Mouth Images and Base Images

After all visemes are determined according to the methods proposed in the previous sections, the mouth image of a viseme needs to be integrated into a base image using the alpha-blending technique, as shown in Fig. 5.9. Here a mouth image represents a region on a facial image that contains lips and a jaw. However, the determination of the region is not easy. In [6], Lin and Tsai used a fixed rectangle surrounding the mouth to represent the region. The approach is simple to implement;

however, the determination of the size of the rectangle is not an easy job because it affects integrated results severely. A too wide rectangle that overlaps the background such as the walls may cause the background to be “integrated” into the face. A too short rectangle may cause the jaw to “drop down” while opening the mouth because only part of the jaw is moving.

+ =

(a) (b) (c)

Fig. 5.9 Integration of a base image and a mouth image. (a) A base image. (b) A mouth image. (c) The integrated image.

A method is proposed to determine the region of a mouth image. Suppose that there are two images I1 and I2, and the mouth region of I1 is to be integrated into I2. Then the method goes as follows.

Firstly, since the position and size of the base region are known already using the methods proposed in Chapter 4, we can determine the skin color by averaging the colors in the base region. Secondly, skin regions of I1 and I2 are determined as S1 and S₂, respectively. Finally, the intersection region S_intersect of S₁ and S₂is found and a region growing method is utilized to discard noise. Sintersect is used as the mouth region.

Besides using Sintersect as the mouth region, a trapezoid inside it can also be used as another choice of the mouth region. Fig. 5.10 shows examples of these two kinds of mouth regions.

Fig. 5.10 Example of mouth regions found using the proposed method. (a) The intersection region of skin parts. (b) A trapezoid inside the intersection region.

5.6 Experimental Results

In this section, some experimental results of the proposed methods are shown. In Fig. 5.11, the model is speaking a sentence. And in Fig. 5.12, the model is singing a song.

Fig. 5.11 Result of frame generation of a speaking case. The person is speaking the sentence “夕陽”.

Fig. 5.11 Result of frame generation of a speaking case. The person is speaking the sentence “夕陽”. (Continued)

Fig. 5.12 Result of frame generation of a singing case. The person is speaking the sentence “如果雲知道”. The frames shown are part of “知道”.

Fig. 5.12 Result of frame generation of a singing case. The person is speaking the sentence “如果雲知道”. The frames shown are part of “知道”. (Continued)

5.7 Summary and Discussions

In this chapter, the concentration has been put on improving the quality of generated animations. The first issue was synchronization between a speech and image frames because people can notice the asynchronous problem easily. The

proposed method reduced the synchronization error down to be shorter in time than the period of a frame. To make the animations natural, the behaviors of real talking persons and singing persons were discussed so that generated virtual faces can act in the same way as real human beings. The articulation effect of transitions was also noticed and solved. Finally, mouth regions found by the proposed method were proper to be integrated into base images.

Chapter 6 Examples of Applications 6.1 Introduction

Virtual talking faces can be applied to many areas. For example, they can be used as agents or assistants to help people do their jobs. They can also be used as tutors to help students study.

In this chapter, some examples of applications are described. Section 6.2 shows how virtual announcers that are able to report news are created. Section 6.3 presents virtual singers that can sing songs. In Section 6.4, virtual talking faces are integrated into emails, so that people can watch their friends reading the contents of received emails. In Section 6.5, some other possible applications are listed.

6.2 Virtual Announcers

6.2.1 Description

Virtual announcers are virtual talking faces that can report news. As shown in Fig.

6.1, real news reporters appear on television screens with their faces and part of their bodies seen. Since real news reporters may sometimes get sick or be occupied by other tasks, virtual announcers can take the place of them who shall record news releases in advance. Moreover, they can even replace real news reporters as the techniques are exquisitely applied and they exhibit very realistic appearances.

Fig. 6.1 Examples of real news reporters.

6.2.2 Process of Creation

The process of creating a virtual announcer is shown in Fig. 6.2. Firstly, the video of a speaking person needs to be recorded. Secondly, the feature learning process extracts all required features from the video. These two jobs follow the processes proposed in Chapters 2 through 4. Then, any audio of news can be sent to the animation generation process to create animations of virtual announcers. It is noticeable that the frame generation method for the speaking case proposed in Section 5.3.1 is utilized at this stage.

An example of a virtual announcer is shown in Fig. 6.3. The background may be changed dynamically to fit the news that is being reported.

6.3 Virtual Singers

6.3.1 Description

Similar to virtual announcers, virtual singers make use of virtual faces that are able to sing songs. They can be used for entertainment.

Fig. 6.2 Block diagram of creation of virtual announcers.

Fig. 6.3 Example of a virtual announcer.

6.3.2 Process of Creation

The process of creating a virtual singer is shown in Fig. 6.4. The process is mostly the same as the one of creating a virtual announcer. The only difference is that that the frame generation method for the singing case proposed in Section 5.3.2, instead of that of the speaking case, is utilized.

Fig. 6.4 Block diagram of creation of virtual singers.

6.4 Emails with Virtual Talking Faces

6.4.1 Description

The use of emails is a very common way to transmit messages nowadays. People send and receive emails almost everyday. When a person receives an email, he/she needs to read the content of the email to know what the sender wants to express. A virtual talking face can be integrated into an email to enable the receiver to understand the content of the email by watching and listening to the animation of the sender’s face without reading the content.

Fig. 6.5 shows comparisons between normal emails and emails with embedded virtual talking faces. Fig. 6.5(a) shows that a text email is sent through the Internet and then received. The receiver needs to read the text message. Fig. 6.5(b) shows the

process of sending and receiving an email with virtual talking faces. An email text and its corresponding speech read by the sender are sent through the Internet. The receiver can generate a virtual talking face that reads the text and then watch it.

Fig. 6.5(c) shows a way to produce a similar result of (b). The sender records a video while he/she is reading the text, and then the video along with the text is sent through the Internet. The receiver can watch the video directly. However, this method has some disadvantages. First, the sender needs to record the scene while he/she is reading, that means a camera is required. Second, the video size is often much larger than that of the speech. Sending large videos through the Internet is slow, and the receiver gets annoyed while receiving big emails.

(a)

(b)

(c)

Fig. 6.5 Comparisons among three kinds of emails.

6.4.2 Process of Sending Emails

As shown in Fig. 6.5(b), the speech needs to be sent along with the email content.

The speech can be wrapped as an attachment file of an email. Some extra information can be added to the attachment file, such as the name of the sender. Fig. 6.6 shows the structure of the attachment file.

Fig. 6.7 shows the entire process of sending emails with virtual talking faces.

Firstly, the sender writes a text and then reads it. The speech is recorded as an audio file. Then the information of the name of the sender, the size of the text, and the speech are combined together with the text and the speech to form an attachment file.

Finally, the attachment file is sent through the Internet.

Fig. 6.6 Structure of an attachment file.

Fig. 6.7 Process of sending emails with a virtual talking face.

6.4.3 Process of Receiving Emails

After receiving an email mentioned in the above section, the receiver can

construct an animation of the sender’s face. The process is illustrated in Fig. 6.8.

Firstly the text size and the speech size are extracted to help segment the text and the speech out. The sender name is used to select the face to generate. Then, the text and the speech are sent to the animation generation process as inputs to produce an animation.

Fig. 6.8 Process of receiving an email with a virtual talking face.

6.5 Other Applications

Besides the applications mentioned in the previous sections of this chapter, there are still many possible applications of virtual talking faces. Some of them are listed as follows.

1. Virtual tutors: Virtual talking faces can be used to help students study. It is interesting and useful for the students to have tutors teaching rather than to study by themselves.

2. Virtual guides: Virtual talking faces can be used in libraries and museums to guide visitors. Visitors can roam around in the buildings and listen to the

illustration by a virtual guide.

3. Virtual babysitters: Virtual talking faces can read stories for children as babysitters.

4. Software agents: Virtual talking faces can be used in the software to help users operate the software.

5. Videoconferences: The amount of data required for virtual talking faces to be transmitted through networks is much less than that for transmitting image frames.

6.6 Experimental Results

In this section, some experimental results of creating emails with virtual talking faces are shown. The experimental results of virtual announcers and singers are shown in Section 5.6 in Chapter 5 already.

Fig. 6.9 shows the program interface of animation generation. Firstly, the email sender inputs a text and a speech, and then presses the button marked in red to create an attachment file, as shown in Fig. 6.10. The red circle indicates the attachment file.

As shown in Fig. 6.11, after receiving the email, the sender also receives the attachment. It is noticed that the size of the attachment file is only 120 KB, which is very small.

The receiver then opens the attachment file, and the file relationship will cause the program to start the proposed animation generation process, as shown in Fig. 6.12 and 6.13. The final animation generated is shown in Fig. 6.14(a). Fig. 6.14(b) shows that the size of the animation is 3.67 MB, which is much larger than that of the attachment file. Therefore, it is obvious that transmitting the virtual talking faces using the method proposed in Section 6.4 is preferred to transmitting the videos

directly.

Fig. 6.9 The program interface of animation generation.

Fig. 6.10 The email attachment is created.

Fig. 6.11 The attachment is received along with the email text.

Fig. 6.12 Setting of the file relationship between the attachment file and the program.

Fig. 6.13. The animation generation process is generating the animation of a person.

(a) (b) Fig. 6.14 The generated animation. (a) A frame of the animation. (b) The size of the

generated animation.

6.7 Summary and Discussions

In this chapter, three examples of virtual talking faces are described and implemented. The virtual announcers can be used to report news while the virtual

singers can sing songs. A system for transmitting virtual talking faces by emails is also proposed. It is very interesting and useful to watch friends read their mails. And the amount of transmitted data is much smaller than that of transmitting similar videos.

Chapter 7 Experimental Results and Discussions

7.1 Experimental Results

In this study, a system for automatic feature learning and animation creation is constructed. Some screenshots of the system are shown in this section.

Firstly, a video that contains a speaking face was recorded, and then audio data and image frames were extracted from it. In Fig. 7.1, the extracted audio data were segmented into seventeen sound parts using the sentence segmentation algorithm proposed in Chapter 3. Blue and green parts represent odd and even sound parts, respectively.

Fig. 7.1 The audio data are segmented into seventeen sound parts.

Next, the timing information of syllables of these sound parts was learned using the syllable alignment technique, as shown in Fig. 7.2. Blue and green parts represent odd and even syllables, respectively. Thirdly, the base region of the first frame of the video was determined. Fig. 7.3 shows the base region with a blue rectangle. Fig. 7.4 shows the final step of the feature learning process. The position and the rotated angle of the base region for every frame were learned using the methods proposed in Chapter 4. The black cross on the right face in Fig. 7.4 indicates the center of the base region.

After the feature learning process was completed, the animation generation process was started. Firstly, the syllable alignment technique was applied to an input speech to get the timing information of syllables in the speech. In Fig. 7.5, a speech containing a Mandarin sentence “熟能生巧” was used as an input. The timing information of syllables in it was learned and displayed in blue and green colors, which represent odd and even syllables, respectively.

Fig. 7.2 The result of syllable alignment of the sentence “好朋友自遠方來”.

After the timing information of the input speech is known, virtual talking face animations can be generated. Using the methods proposed in Chapter 5, we could generate proper frames. Fig. 7.6 shows the intermediate result of the process. The right face shows the region that was to be “pasted” onto the base image, and the left face shows the result of integration. Fig. 7.7 shows the final created animation in frames. The face in the frames is speaking the Chinese sentence “熟能生巧”.

Fig. 7.3 The base region determined is displayed with a blue rectangle.

Fig. 7.4 Learning of the position of the base region for every frame.

Fig. 7.5 The result of syllable alignment of the input speech “熟能生巧”.

Fig. 7.6 The middle result of the frame generation process.

Fig. 7.7 The result of the animation generation process.

Fig. 7.7 The result of the animation generation process. (Continued)

7.2 Discussions

After presenting the experimental results, we would like to discuss a number of issues in concern as follows.

The first issue is the video recording process. Since a transcript that contains all classes of Mandarin syllables is designed to consist of seventeen short and meaningful sentences, the model can read them easily. The model can also shake his/her head slightly during the process. The process takes about two minutes to complete, which is quite short and not annoying.

The second issue is the feature learning process. To learn the information of audio features, the audio data of the recorded video are segmented into sentences first.

The proposed sentence segmentation algorithm is effective both in quiet environments and in environments with constant noise - like that caused by fans. The result of syllable alignment may not be completely correct; however, it is still acceptable.

To learn the information of facial features, the position and rotated angle of the base region in every frame are detected. The information learned is correct even when the face is shaking slightly. Besides, the process is automatic; therefore, we can avoid finding the positions manually, which is a tedious work. Normally, the process takes about half an hour to complete, which is a little lengthy but acceptable.

Finally, we discuss the animation generation process. In this process, proper frames are generated. The experimental results show that the generated talking faces are natural. The faces also shake naturally due to the arrangement of base images. It is shown that 2D images are suitable for creating realistic talking faces with slight head shaking actions.

Chapter 8 Conclusions and Suggestions for Future Works

8.1 Conclusions

In this study, a system for creating virtual talking faces has been implemented.

The system is based on the use of 2D facial images. Techniques of image processing and speech recognition are utilized. Methods are proposed to automate the learning process and improve the quality of the generated animations.

The system is composed of three processes: video recording, feature learning, and animation generation. The video recording process is designed to be short, easy, and not annoying. A transcript that contains all classes of Mandarin syllables has been proposed, and a model can read the sentences on it, instead of reading the syllables one by one separately. The sentences were designed to be short and meaningful, so that the model can read them without difficulties. Due to the function of the feature learning process, the model is allowed to shake his/her head slightly, yielding more natural animation results.

In the feature learning process, audio features, facial features, and base image sequences are all learned automatically. The sentence segmentation algorithm proposed in the learning of audio features is simple but robust in both quiet environments and environments with constant noise. The generation process of base image sequences was designed to exhibit natural head shaking actions. The learning process of facial features was designed to be able to handle shaking faces.

In the animation generation process, several methods were proposed to improve the quality of generated animations. A method was proposed to reduce the synchronization error between a speech and image frames down to be shorter in time than the period of a frame. The number of required transition frames between successive syllables is analyzed to smooth the transitions. The behaviors of a talking person and a singing person are also analyzed, and a method of frame generation that is proper to create both talking and singing faces was proposed. Applications that utilize the proposed techniques such as virtual announcers and virtual singers have been implemented. Another application that integrates virtual talking faces into emails was also described and implemented.

8.2 Suggestions for Future Works

Several suggestions to improve the proposed system are listed as follows.

(1) Reduction of the number of visemes --- The viseme information of 115 classes of Mandarin syllables is required to create animations containing arbitrary Mandarin syllables. However, the number of required visemes can still be reduced, because mouth shapes of some of them are quite similar. Reduction of the number of required visemes can shorten the processing time of learning dramatically.

(2) Detection of base regions automatically on faces with glasses --- In Chapter 4, a knowledge-based face recognition method is utilized to locate the base region on the face of the first frame. Since the method needs to learn the eye-pair on the face, it does not always work well on faces wearing glasses. If this problem can be solved, then the feature learning process will become fully automatic even on faces wearing glasses.

(3) Integration of emotional expressions --- In this study, emotional expressions are not used because they affects many parts of faces.

在文檔中自動化建構虛擬說話人臉與其相關應用之研究 (頁 71-0)