• 沒有找到結果。

Chapter 6 Examples of Applications

6.5 Other Applications

Besides the applications mentioned in the previous sections of this chapter, there are still many possible applications of virtual talking faces. Some of them are listed as follows.

1. Virtual tutors: Virtual talking faces can be used to help students study. It is interesting and useful for the students to have tutors teaching rather than to study by themselves.

2. Virtual guides: Virtual talking faces can be used in libraries and museums to guide visitors. Visitors can roam around in the buildings and listen to the

illustration by a virtual guide.

3. Virtual babysitters: Virtual talking faces can read stories for children as babysitters.

4. Software agents: Virtual talking faces can be used in the software to help users operate the software.

5. Videoconferences: The amount of data required for virtual talking faces to be transmitted through networks is much less than that for transmitting image frames.

6.6 Experimental Results

In this section, some experimental results of creating emails with virtual talking faces are shown. The experimental results of virtual announcers and singers are shown in Section 5.6 in Chapter 5 already.

Fig. 6.9 shows the program interface of animation generation. Firstly, the email sender inputs a text and a speech, and then presses the button marked in red to create an attachment file, as shown in Fig. 6.10. The red circle indicates the attachment file.

As shown in Fig. 6.11, after receiving the email, the sender also receives the attachment. It is noticed that the size of the attachment file is only 120 KB, which is very small.

The receiver then opens the attachment file, and the file relationship will cause the program to start the proposed animation generation process, as shown in Fig. 6.12 and 6.13. The final animation generated is shown in Fig. 6.14(a). Fig. 6.14(b) shows that the size of the animation is 3.67 MB, which is much larger than that of the attachment file. Therefore, it is obvious that transmitting the virtual talking faces using the method proposed in Section 6.4 is preferred to transmitting the videos

directly.

Fig. 6.9 The program interface of animation generation.

Fig. 6.10 The email attachment is created.

Fig. 6.11 The attachment is received along with the email text.

Fig. 6.12 Setting of the file relationship between the attachment file and the program.

Fig. 6.13. The animation generation process is generating the animation of a person.

(a) (b) Fig. 6.14 The generated animation. (a) A frame of the animation. (b) The size of the

generated animation.

6.7 Summary and Discussions

In this chapter, three examples of virtual talking faces are described and implemented. The virtual announcers can be used to report news while the virtual

singers can sing songs. A system for transmitting virtual talking faces by emails is also proposed. It is very interesting and useful to watch friends read their mails. And the amount of transmitted data is much smaller than that of transmitting similar videos.

Chapter 7

Experimental Results and Discussions

7.1 Experimental Results

In this study, a system for automatic feature learning and animation creation is constructed. Some screenshots of the system are shown in this section.

Firstly, a video that contains a speaking face was recorded, and then audio data and image frames were extracted from it. In Fig. 7.1, the extracted audio data were segmented into seventeen sound parts using the sentence segmentation algorithm proposed in Chapter 3. Blue and green parts represent odd and even sound parts, respectively.

Fig. 7.1 The audio data are segmented into seventeen sound parts.

Next, the timing information of syllables of these sound parts was learned using the syllable alignment technique, as shown in Fig. 7.2. Blue and green parts represent odd and even syllables, respectively. Thirdly, the base region of the first frame of the video was determined. Fig. 7.3 shows the base region with a blue rectangle. Fig. 7.4 shows the final step of the feature learning process. The position and the rotated angle of the base region for every frame were learned using the methods proposed in Chapter 4. The black cross on the right face in Fig. 7.4 indicates the center of the base region.

After the feature learning process was completed, the animation generation process was started. Firstly, the syllable alignment technique was applied to an input speech to get the timing information of syllables in the speech. In Fig. 7.5, a speech containing a Mandarin sentence “熟能生巧” was used as an input. The timing information of syllables in it was learned and displayed in blue and green colors, which represent odd and even syllables, respectively.

Fig. 7.2 The result of syllable alignment of the sentence “好朋友自遠方來”.

After the timing information of the input speech is known, virtual talking face animations can be generated. Using the methods proposed in Chapter 5, we could generate proper frames. Fig. 7.6 shows the intermediate result of the process. The right face shows the region that was to be “pasted” onto the base image, and the left face shows the result of integration. Fig. 7.7 shows the final created animation in frames. The face in the frames is speaking the Chinese sentence “熟能生巧”.

Fig. 7.3 The base region determined is displayed with a blue rectangle.

Fig. 7.4 Learning of the position of the base region for every frame.

Fig. 7.5 The result of syllable alignment of the input speech “熟能生巧”.

Fig. 7.6 The middle result of the frame generation process.

Fig. 7.7 The result of the animation generation process.

Fig. 7.7 The result of the animation generation process. (Continued)

Fig. 7.7 The result of the animation generation process. (Continued)

7.2 Discussions

After presenting the experimental results, we would like to discuss a number of issues in concern as follows.

The first issue is the video recording process. Since a transcript that contains all classes of Mandarin syllables is designed to consist of seventeen short and meaningful sentences, the model can read them easily. The model can also shake his/her head slightly during the process. The process takes about two minutes to complete, which is quite short and not annoying.

The second issue is the feature learning process. To learn the information of audio features, the audio data of the recorded video are segmented into sentences first.

The proposed sentence segmentation algorithm is effective both in quiet environments and in environments with constant noise - like that caused by fans. The result of syllable alignment may not be completely correct; however, it is still acceptable.

To learn the information of facial features, the position and rotated angle of the base region in every frame are detected. The information learned is correct even when the face is shaking slightly. Besides, the process is automatic; therefore, we can avoid finding the positions manually, which is a tedious work. Normally, the process takes about half an hour to complete, which is a little lengthy but acceptable.

Finally, we discuss the animation generation process. In this process, proper frames are generated. The experimental results show that the generated talking faces are natural. The faces also shake naturally due to the arrangement of base images. It is shown that 2D images are suitable for creating realistic talking faces with slight head shaking actions.

Chapter 8

Conclusions and Suggestions for Future Works

8.1 Conclusions

In this study, a system for creating virtual talking faces has been implemented.

The system is based on the use of 2D facial images. Techniques of image processing and speech recognition are utilized. Methods are proposed to automate the learning process and improve the quality of the generated animations.

The system is composed of three processes: video recording, feature learning, and animation generation. The video recording process is designed to be short, easy, and not annoying. A transcript that contains all classes of Mandarin syllables has been proposed, and a model can read the sentences on it, instead of reading the syllables one by one separately. The sentences were designed to be short and meaningful, so that the model can read them without difficulties. Due to the function of the feature learning process, the model is allowed to shake his/her head slightly, yielding more natural animation results.

In the feature learning process, audio features, facial features, and base image sequences are all learned automatically. The sentence segmentation algorithm proposed in the learning of audio features is simple but robust in both quiet environments and environments with constant noise. The generation process of base image sequences was designed to exhibit natural head shaking actions. The learning process of facial features was designed to be able to handle shaking faces.

In the animation generation process, several methods were proposed to improve the quality of generated animations. A method was proposed to reduce the synchronization error between a speech and image frames down to be shorter in time than the period of a frame. The number of required transition frames between successive syllables is analyzed to smooth the transitions. The behaviors of a talking person and a singing person are also analyzed, and a method of frame generation that is proper to create both talking and singing faces was proposed. Applications that utilize the proposed techniques such as virtual announcers and virtual singers have been implemented. Another application that integrates virtual talking faces into emails was also described and implemented.

8.2 Suggestions for Future Works

Several suggestions to improve the proposed system are listed as follows.

(1) Reduction of the number of visemes --- The viseme information of 115 classes of Mandarin syllables is required to create animations containing arbitrary Mandarin syllables. However, the number of required visemes can still be reduced, because mouth shapes of some of them are quite similar. Reduction of the number of required visemes can shorten the processing time of learning dramatically.

(2) Detection of base regions automatically on faces with glasses --- In Chapter 4, a knowledge-based face recognition method is utilized to locate the base region on the face of the first frame. Since the method needs to learn the eye-pair on the face, it does not always work well on faces wearing glasses. If this problem can be solved, then the feature learning process will become fully automatic even on faces wearing glasses.

(3) Integration of emotional expressions --- In this study, emotional expressions are not used because they affects many parts of faces.

However, if emotional expressions can be integrated, generated talking faces will become more interesting and natural.

(4) Integration of gestures and body actions --- The integration of gestures and body actions is also a topic that is worth studying. Talking faces with gestures and body actions are more lifelike. The job is easier than the one of integration of emotional expressions, because it involves only base images.

(5) Real-time animation generation --- In the animation generation process, a pre-recorded audio must be inputted and analyzed to get the timing information of syllables in it. If the timing information can be learned in real-time, animations that are synchronized with a speaking person can be generated in real time.

(6) Enhancement of smoothing between visemes --- The movement of the lips produced by the proposed system is still not natural enough. Possible enhancements such as interpolation between successive frames can be studied.

(7) Consideration of syllable collocations --- The viseme of an identical syllable may vary while collocating with different syllables, and this phenomenon should be considered to create more natural animations.

(8) Study of freer video recording process --- In the proposed system, it is necessary for a model to read the entire transcript to create animations starred by the model. If the required features can be gathered from several fragments of videos starred by the model instead of a complete video, the process will be freer and more preferable.

References

[1] J. Rickel, S. Marsella, J. Gratch, R. Hill, D. Traum, and W. Swartout, “Toward a New Generation of Virtual Humans for Interactive Experiences,” IEEE Intelligent Systems, Vol. 17, No 4, 2002, pp. 32-38.

[2] C. Zhang and F. S. Cohen, “3-D Face Structure Extraction and Recognition From Images Using 3-D Morphing and Distance Mapping,” IEEE Transactions on Image Processing, Vol. 11, No. 11, Nov. 2002.

[3] T. Goto, S. Kshirsagar, and N. M. Thalmann, “Automatic Face Cloning and Animation – Using Real-Time Facial Feature Tracking and Speech Acquisition,”

IEEE Signal Processing Magazine, 2001.

[4] T. Ezzat, G. Geiger, and T. Poggio, “Trainable Videorealistic Speech Animation,”

Proceedings of SIGGRAPH, San Antonio Texas, July 21-26, 2002.

[5] E. Cosatto and H. P. Graf, “Photo-Realistic Talking-Heads from Image Samples,”

IEEE Transactions on Multimedia, Vol. 2, No. 3, Sep. 2000.

[6] Y. C. Lin and W. H. Tsai, “A Study on Virtual Talking Head Animation by 2D Image Analysis and Voice Synchronization Techniques,” M. S. Thesis, Department of Computer and Information Science, National Chiao Tung University, Hsinchu, Taiwan, Republic of China, June 2002.

[7] S. A. King and R. E. Parent, “Lip Synchronization for Song,” Proceedings of the Computer Animation, 2002.

[8] R. C. Gonzalez and R. E. Woods, “Digital Image Processing,” Second Edition, pp52-54.

[9] T. Ezzat and T. Poggio, “Facial Analysis and Synthesis Using Image-Based

Models,“ Proceedings of the Second International Conference on Automatic Face and Gesture Recognition, Oct. 1996.

[10] N. M. Thalmann, P. Kalra, and M. Escher, “Face to Virtual Face,” Proceedings of the IEEE, Vol. 86, No. 5, May 1998.

[11] W. S. Lee and N. M. Thalmann, “Generating a Population of Animated faces from Pictures,” IEEE International Workshop on Modelling People, Corfu, Greece, Sep. 20-20, 1999, pp. 62-62.

[12] T. Ezzat and T. Poggio, “MikeTalk: A Talking Facial Display Based on Morphing Visemes,” Proceedings of the Computer Animation Conference, Philadelphia, Pennsylvania, June 1998.

相關文件