以三維模式轉換技術作二維虛擬人臉之自動產生及其應用

全文

(1)國立交通大學 ӛൌᠽήม࣢‫ۺؠ‬ ቺ !ҳ !ᑕ !Ԏ !. 以三維模式轉換技術作二維虛擬人臉之自動產生及其應用 Automatic 2D Virtual Face Generation by 3D Model Transformation Techniques and Applications. 研究生：張依帆指導教授：蔡文祥. 教授. Ӎ!཮!և!౯!!Ґ!Ϻ!Ӥ!!‫!؃‬Ӥ!Ԕ!.

(2) 以三維模式轉換技術作二維虛擬人臉之自動產生及其應用 Automatic 2D Virtual Face Generation by 3D Model Transformation Techniques and Applications. 研究生：張依帆. Student: Yi-Fan Chang. 指導教授：蔡文祥. Advisor: Prof. Wen-Hsiang Tsai. 國立交通大學多媒體工程研究所碩士論文. A Thesis Submitted to Institute of Multimedia Engineering College of Computer Science National Chiao Tung University in partial Fulfillment of the Requirements for the Degree of Master in. Computer Science June 2007 Hsinchu, Taiwan, Republic of China. 中華民國九十六年六月.

(3) 以三維模式轉換技術作二維虛擬人臉之自動產生及其應用. 研究生：張依帆. 指導教授：蔡文祥博士. 國立交通大學多媒體工程研究所. 摘要. 本論文提出了一套自動產生會說話的虛擬卡通臉系統。這個系統包含了四個階段：卡通人臉產生、語音分析、臉部表情與嘴形合成、動畫製作。配合本論文採用的人臉模型，系統會自動建構出一個三維人臉座標系統，並利用三維轉換技術產生不同角度的二維卡通人臉。同時我們以部份特徵點作為控制點來控制卡通人臉的表情，並藉由統計的方法來模擬說話時自然轉頭及不同表情的時間變化。接著，藉由分析輸入的語音及相對應的文字稿件，我們將語音以句子的形式作切割，再使用語音同步技術，配合提出的十二種基本嘴形來模擬會說話的卡通臉。最後，藉由一可編輯且具有開放性之可擴展標記語言(XML)，亦即 SVG，來達成繪圖及語音同步輸出之效果。利用上述方法，我們實作出兩種有趣的應用。從我們所獲得的良好實驗結果，證實了本論文所提出方法之可行性及應用性。. i.

(4) Automatic 2D Virtual Face Generation by 3D Model Transformation Techniques and Applications Student: Yi-Fan Chang. Advisor: Prof. Wen-Hsiang Tsai, Ph. D.. Institute of Multimedia Engineering, College of Computer Science National Chiao Tung University. ABSTRACT In this study, a system for automatic generation of talking cartoon faces is proposed, which includes four processes: cartoon face creation, speech analysis, facial expression and lip movement synthesis, and animation generation. A face model of 72 facial feature points is adopted. A method for construction of a 3D local coordinate system for the cartoon face is proposed, and a transformation between the global and the local coordinate systems by the use of a knowledge-based coordinate system transformation method is conducted. A 3D rotation technique is applied to the cartoon face model with some additional points to draw the face in different poses. A concept of assigning control points is applied to animate the cartoon face with different facial expressions. A statistical method is proposed to simulate the timing information of various facial expressions. For lip synchronization, a sentence utterance segmentation algorithm is proposed and a syllable alignment technique is applied. Twelve basic mouth shapes for Mandarin speaking are defined to synthesize lip movements. A frame interpolation method is utilized to generate the animation. Finally, an editable and opened vector-based XML language - Scalable Vector Graphics (SVG) is used for rendering and synchronizing the cartoon face with speech. Two kinds of interesting applications are implemented. Good experimental results show the feasibility and applicability of the proposed methods. ii.

(5) ACKNOWLEDGEMENTS. I am in hearty appreciation of the continuous guidance, discussions, support, and encouragement received from my advisor, Dr. Wen-Hsiang Tsai, not only in the development of this thesis, but also in every aspect of my personal growth. Thanks are due to Mr. Chih-Jen Wu, Miss Kuan-Ting Chen, Mr. Kuan-Chieh Chen, Mr. Jian-Jhong Chen, Mr. Tsung-Chih Wang, and Mr. Shang-Huang Lai for their valuable discussions, suggestions, and encouragement. Appreciation is also given to the colleagues of the Computer Vision Laboratory in the Institute of Computer Science and Engineering at National Chiao Tung University for their suggestions and help during my thesis study. Finally, I also extend my profound thanks to my family for their lasting love, care, and encouragement. I dedicate this dissertation to my beloved parents.. iii.

(6) CONTENTS ABSTRACT(in Chinese)............................................................................ i ABSTRACT(in English)............................................................................ ii ACKNOWLEDGEMENTS...................................................................... iii CONTENTS.............................................................................................. iv LIST OF FIGURES ................................................................................. vii LIST OF TABLES ......................................................................................x. Chapter 1. Introduction ............................................................................1. 1.1 1.2 1.3. Motivation..............................................................................................1 Survey of Related Studies ......................................................................3 Overview of Proposed Method ..............................................................5 1.3.1 Definitions of Terms ......................................................................5 1.3.2 Assumptions...................................................................................7 1.3.3 Brief Descriptions of Proposed Method ........................................7 1.4 Contributions..........................................................................................9 1.5 Thesis Organization ...............................................................................9. Chapter 2. Cartoon Face Generation and Modeling from Single Images . ..............................................................................................11. 2.1 2.2 2.3. Introduction.......................................................................................... 11 Review of Adopted Cartoon Face Model.............................................12 Construction of 3D Face Model Based on 2D Cartoon Face Model ...15 2.3.1 Basic Idea.....................................................................................15 2.3.2 Construction Process....................................................................16 2.4 Creation of Cartoon Face .....................................................................21 2.4.1 Creation of Frontal Cartoon Face ................................................21 2.4.2 Generation of Basic Facial Expressions ......................................24 2.4.3 Creation of Oblique Cartoon Face ...............................................26 2.5 Experimental Results ...........................................................................32. Chapter 3 3.1 3.2. Speech Segmentation for Lip Synchronization....................34 Introduction to Lip Synchronization for Talking Cartoon Faces .........34 Segmentation of Sentence Utterances by Silence Feature...................36 iv.

(7) 3.2.1 3.2.2. Review of Adopted Segmentation Method ..................................36 Segmentation Process ..................................................................37 3.3 Mandarin Syllable Segmentation.........................................................40 3.3.1 Review of Adopted Method .........................................................40 3.3.2 Segmentation Process ..................................................................41 3.4 Experimental Results ...........................................................................41. Chapter 4. Animation of Facial Expressions .........................................44. 4.1 4.2. Introduction..........................................................................................44 Analysis of Facial Expression Data from Images of TV News Announcers ..........................................................................................45 4.3 Review of Adopted Simulation Methods of Eye Blinks and Eyebrow Movements...........................................................................................47 4.4 Simulation of Eyebrow Movements ....................................................50 4.5 Simulation of Head Tilting and Turning ..............................................52 4.5.1 Simulation of Head Tilting...........................................................52 4.5.2 Simulation of Horizontal Head Turning ......................................54 4.5.3 Simulation of Vertical Head Turning ...........................................54. Chapter 5. Talking Cartoon Face Generation ........................................60. 5.1 5.2. Introduction..........................................................................................60 Definitions of Basic Mouth Shapes .....................................................61 5.2.1 Review of Definition of Basic Mouth Shapes .............................63 5.2.2 New Definitions of Basic Mouth Shapes.....................................67 5.3 Review of Adopted Time Intervals of Sentence Syllables for Mouth Shape Generation .................................................................................73 5.4 Talking Cartoon Face Generation by Synthesizing Lip Movements ...74 5.5 Experimental Results ...........................................................................75. Chapter 6. Talking Cartoon Face Generator Using Scalable Vector Graphics ...............................................................................78. 6.1 6.2 6.3. Introduction..........................................................................................78 Overview of Scalable Vector Graphics (SVG) ....................................78 Construction of a Talking Cartoon Face Generator Using SVG..........80 6.3.1 Spatial Domain Process ...............................................................80 6.3.2 Temporal Domain Process ...........................................................84 6.4 Experimental Results ...........................................................................85. v.

(8) Chapter 7. Applications of Talking Cartoon Faces................................88. 7.1 7.2. Introduction to Implemented Applications ..........................................88 Application to Virtual Announcers ......................................................89 7.2.1 Introduction to Virtual Announcers..............................................89 7.2.2 Process of Talking Face Creation.................................................89 7.2.3 Experimental Results ...................................................................90 7.3 Applications to Audio Books for E-Learning ......................................91 7.3.1 Introduction to Audio Books........................................................91 7.3.2 Process of Audio Book Generation..............................................92 7.3.3 Experimental Results ...................................................................93. Chapter 8 8.1 8.2. Conclusions and Suggestions for Future Works ..................94 Conclusions..........................................................................................94 Suggestions for Future Works..............................................................95. References 98. vi.

(9) LIST OF FIGURES Figure 1.1 Figure 2.1 Figure 2.2 Figure 2.3 Figure 2.4 Figure 2.5. Figure 2.6 Figure 2.7 Figure 2.8 Figure 2.9. Figure 2.10 Figure 2.11 Figure 2.12 Figure 2.13 Figure 2.14 Figure 2.15 Figure 2.16 Figure 2.17. Configuration of proposed system. ...........................................................8 Flowchart of hierarchical bi-level thresholding method in Chen and Tsai [1]............................................................................................................13 A face model. (a) Proposed 72 feature points. (b) Proposed facial animation parameter units in Chen and Tsai [1]. ....................................14 An illustration of corner-cutting algorithm in Chen and Tsai [1]. ..........14 Cubic Bezier curve in Chen and Tsai [1]. ...............................................15 Two coordinate systems. The lines drawn in black color represent the global coordinate system, and those drawn in red color represent the local one...........................................................................................................17 Points to help drawing. ...........................................................................17 Two orthogonal photographs. (a) Front view. (b) Side view. .................20 An illustration of arc(P1, …, Pn).............................................................22 An illustration of the steps in the creation of the frontal cartoon face. (a) The creation of the contour of a face. (b) The creation of the ears. (c) The creation of the nose. (d) The creation of the eyebrows. (e) The creation of the eyes. (f) The creation of the mouth. ..................................................24 An experimental result of the creation of a frontal cartoon face. (a) A male face model. (b) A female face model. .....................................................24 An experimental result of generation of an eye blinking effect..............25 An experimental result of generation of a smiling effect........................26 An experimental result of generation of an eyebrow raising effect........26 An illustration of a point rotated on the three Cartesian axes.................27 An illustration of a point rotated on the Y axis.......................................28 An illustration of the focus and eyeballs. (a) Before rotation. (b) After rotation. ...................................................................................................29 An illustration of the unreality of the hair contour. ................................30. Figure 2.18 An illustration of the shift of hair contour points. (a) Before rotation. (b) After rotation...........................................................................................31 Figure 2.19 An illustration of creation of oblique cartoon faces. (a) An oblique cartoon face with β = 15 degrees. (b) An oblique cartoon face with β = −15 degrees. .................................................................................................................32 Figure 2.20 An example of experimental results for creation of cartoon faces in different poses with different facial expressions.....................................33 Figure 3.1 A flowchart of proposed speech analysis process...................................35 vii.

(10) Figure 3.2 Figure 3.3 Figure 3.4 Figure 3.5 Figure 3.6 Figure 3.7. An example of recorded video contents and corresponding actions in Lai and Tsai [4]..............................................................................................36 A flowchart of the sentence utterance segmentation process..................38 An example of selecting the first silent part in an input audio. ..............40 An example of sentence utterances segmentation results. The blue and green parts represent odd and even speaking parts, respectively............40 A flowchart of the Mandarin syllable segmentation process. .................42 An example of entire audio data of a transcript. The content of the transcript is “或許你已聽過，多補充抗氧化劑可以延緩老化。但真相為何？”...................................................................................................42. Figure 3.8 Figure 3.9. The result of syllable alignment of the audio in Figure 3.7. ...................42 An example of entire audio data of a transcript. The content of the transcript is “長期下來，傷害不斷累積，就可能造就出一個較老、較脆弱的身體。”.......................................................................................43. Figure 3.10 Figure 4.1 Figure 4.2 Figure 4.3 Figure 4.4. The result of syllable alignment of the audio in Figure 3.9. ...................43 An illustration of the definitions of ts and te for eyebrow movements. ..45 An illustration of the definitions of ts and te for head movements..........46 A screen shot of the software VirtualDub. ..............................................46 An illustration of the probability function of eye blinks in Lin and Tsai [3]. The one in blue color is the probability function of the Gamma distribution with α = 2 and θ = 1.48. The other one in pink color is the probability function of eye blinks approximated from the analysis data of TV News announcers. .............................................................................49 An illustration of basic components for definition of basic mouth shapes. (a) Control points of the mouth. (b) FAPUs of the mouth and the nose. 64 An illustration of basic mouth shapes of Mandarin initials in Chen and Tsai [1]. (a) Basic mouth shape m. (b) Basic mouth shape f. (c) Basic mouth shape h. ........................................................................................65 An illustration of basic mouth shapes of the Mandarin finals in Chen and Tsai [1]. (a) Basic mouth shape a. (b) Basic mouth shape i. (c) Basic mouth shape u. (d) Basic mouth shape e. (e) Basic mouth shape o. (f) Basic mouth shape n. ..............................................................................67 An illustration of basic mouth shapes of Mandarin initials. (a) Basic mouth shape m. (b) Basic mouth shape f. (c) Basic mouth shape h’. (d) Basic mouth shape h. (e) Basic mouth shape r. (f) Basic mouth shape z. .................................................................................................................72 An illustration of basic mouth shapes of Mandarin finals. (a) Basic mouth shape a. (b) Basic mouth shape i. (c) Basic mouth shape u. (d) Basic. Figure 5.1 Figure 5.2. Figure 5.3. Figure 5.4. Figure 5.5. viii.

(11) mouth shape e. (e) Basic mouth shape o. (f) Basic mouth shape n.........72 Figure 5.6 An illustration of time intervals of a syllable of two basic mouth shapes in [1]............................................................................................................74 Figure 5.7 An illustration of time intervals of a syllable of three basic mouth shapes in [1]........................................................................................................74 Figure 5.8 An illustration of time intervals of a syllable of four basic mouth shapes in [1]............................................................................................................74 Figure 5.9 A concept of the use of key frames.........................................................75 Figure 5.10 An overall illustration of the process of talking cartoon face generation. .................................................................................................................75 Figure 5.11 An experimental result of the talking cartoon face speaking “願望.” ....76 Figure 5.12 An experimental result of the talking cartoon face speaking “波濤.” ....77 Figure 6.1 Figure 6.2 Figure 6.3 Figure 6.4 Figure 6.5 Figure 6.6. A result of an SVG source code..............................................................80 A concept of layers in the special domain. .............................................81 An example of the syntax polyline of SVG.............................................81 An example of the syntax circle of SVG.................................................82 An illustration of eye drawing. ...............................................................82 An example of using the syntax path of SVG to draw the cubic Bezier curve........................................................................................................83 Figure 6.7 An example of adding two layers of the background and the clothes for the cartoon face.............................................................................................83 Figure 6.8 Another example of adding two layers of the background and the clothes for the cartoon face. ................................................................................84 Figure 6.9 An experimental result of the talking cartoon face speaking “蜿蜒.” ....86 Figure 6.10 An experimental result of the talking cartoon face speaking “光明.” ....87 Figure 7.1 Figure 7.2 Figure 7.3 Figure 7.4. An illustration of the process of proposed system. .................................90 An example of a virtual announcer. ........................................................91 Another example of a virtual announcer.................................................91 An example of a virtual teacher. .............................................................93. ix.

(12) LIST OF TABLES Table 2.1 Table 3.1 Table 4.1 Table 4.2 Table 4.3 Table 4.4 Table 4.5 Table 4.6 Table 4.7. The values of the points in the z-direction.................................................20 Descriptions of audio features. ..................................................................35 Statistics of time intervals of eyebrow movements. ..................................51 Statistics of durations of eyebrow movements. .........................................52 Statistics of time intervals of head tilting. .................................................53 Statistics of durations of head tilting. ........................................................54 Statistics of time intervals of horizontal head turning. ..............................55 Statistics of durations of horizontal head turning. .....................................56 The mean values, the standard deviation values, and the adopted intervals of uniform random variables for simulation of horizontal head turning........56 Table 4.8 Some examples of the relation between the vertical head turning and the pause time. .................................................................................................57 Table 4.9 Statistics of time intervals between the nod and the pause. .......................57 Table 4.10 Statistics of durations of the nod................................................................58 Table 4.11 Statistics of durations of the head raising after the nod. ............................58 Table 4.12 The mean value, the standard deviation value, and the adopted intervals of uniform random variables for simulation of vertical head turning. ...........59 Table 5.1 Classification of initials according to the manners of articulation proposed in Yeh [15]..................................................................................................61 Table 5.2 An illustration of 7 kinds of mouth shapes of initials proposed in Yeh [15]. ....................................................................................................................62 Table 5.3 Three basic mouth shapes of Mandarin initials in Chen and Tsai [1]........63 Table 5.4 A set of combinations with 7 basic mouth shapes of Mandarin finals in Chen and Tsai [1].................................................................................................64 Table 5.5 Five basic mouth shapes of Mandarin initials............................................68 Table 5.6 A set of combinations with 7 basic mouth shapes of Mandarin finals.......68. x.

(13) Chapter 1 Introduction. 1.1 Motivation With the improvement on network and multimedia technologies, people are changing their habits of using computers. They start to get used to dealing with their works by means of their computers and obtain information through the Internet. Therefore, different types of multimedia, including text, image, audio, and video, are evolved. People can now read news and acquire new knowledge with a great diversity of forms on the Internet. However, the variety of multimedia types does not make computers friendlier to interact with. People are still unsatisfied because the computer still lacks human nature. Due to this reason, more and more researchers devote themselves to improving the interaction between humans and computers. Several researchers report that the use of virtual talking faces, which are animations of human faces on computer screens, can increase the attention paid by users. People not only can get impressed by the virtual face and hence keep relevant information in mind, but also can have a good time interacting with the computer. Although the topic of virtual talking faces has been studied for many years, generating realistic talking faces is still a challenging task because face movements are quite complicated to simulate, especially the deformation of muscles. To achieve the goal of generating realistic virtual faces, it requires some motion capture equipments, which are too expensive for common users. For some applications, 1.

(14) realism is not an essential property. Instead, some people are more concerned about how to make the talking faces livelier to express their words in a natural way. For this reason, we use a 3D cartoon-like virtual face model which can be used to display proper lip movements synchronized with the speech, lifelike head movements, and emotional expressions. The style of non-photorealistic cartoon faces can be designed more freely than photorealistic ones. Shapes and textures of cartoon faces can be represented simply, so it is unnecessary to calculate the complex deformation of muscles. Also, the data size of the face model may be reduced because of the simpler representation of the cartoon face. Expressions can be easily modified by relocating the positions of predefined feature points, so it is full of variety and fun to use cartoon face models to display personalized faces instead of using photorealistic ones. However, generating a three-minute animation requires at least 4320 frames. To animate the cartoon face without dealing with it frame by frame, we must apply some methodology to generate it automatically. One method to generate the animation of virtual faces automatically is to use the technique of real-time motion capturing which has been developed for many years. By putting some markers on one’s face and tracking them continuously by sensors, and then mapping the tracked markers to the virtual face, we can extract facial features and create corresponding facial expressions on the virtual face. As we have mentioned above, this approach is too expensive and complicated for ordinary users. To animate virtual faces more realistically but with less effort, we want to design a method to generate a virtual cartoon face speaking Mandarin, which just requires existing cartoon face models and segments of speech as input. We hope that we can achieve our goal by utilizing the techniques of 3D coordinate generation and. 2.

(15) transformation, speech processing and statistical simulation, as well as creation of basic emotions and head movements. Based on the research result of this study, it is also hoped that the generation of virtual announcers, virtual teachers, virtual storytellers, and so on, may become easier and more convenient for use in various applications.. 1.2 Survey of Related Studies Roughly speaking, there are two main phases to generate virtual talking faces. One is the creation of head models, including 2D and 3D models. The other is the generation of virtual face animations with speech synchronization. For the first phase, the main issue is how to create a face model. In Chen and Tsai [1], cartoon face models are created from single front-view facial images by extraction of facial features points, and cartoon faces are generated by curve drawing techniques. In Zhang and Cohan [5], multiple image views of a particular face are utilized to morph the generic 3D face model into specific face structures. In Goto, Kshirsagar, and Thalmann [6], the method of automatic face cloning using two orthogonal photographs was proposed. It includes two steps: face model matching and texture generation. After these two steps are performed, a generic 3D face model is deformed to fit to the photographs. Zhang et al. [12] advanced a practical approach that also needs only two orthogonal photos for fast 3D modeling. They used radial basis functions (RBF) to deform a generic model with corresponding feature points and then performed texture mapping for realistic modeling. Chen et al. [7, 8] proposed a method to automatically generate a facial sketch of human portraits from input images. They used the non-parametric sampling method to learn the drawing 3.

(16) styles in the sketches illustrated by an artist. According to the learned styles, they can fit a flexible sketch model to the input images and then generate the corresponding sketches by an example-based method. The non-parametric sampling scheme and the example-based method are also adopted in the cartoon system PicToon, which is designed in [9]. The system can be used to generate a personalized cartoon face from an input image. By sketch generation and stroke rendering techniques, a stylized cartoon face is created. In order to animate a virtual talking face, speech synchronization is an important issue to be concerned about. In [1] and [9], cartoon faces are animated by an audio-visual mapping between input speeches and the corresponding lip configuration. In Li et al. [10], cartoon faces are animated not only from input speeches, but also based on emotions derived from speech signals. In [3] and [4], methods of animating a photorealistic virtual face were studied. A frame generation algorithm was used for audio synchronization to generate a talking face. Another approach to generating virtual talking face animations is to track the facial features in real-time and map the feature points to the control points on the face model. A method for real-time tracking was proposed in [11] by putting some markers on faces. To track facial features without markers, some image processing techniques are required. A system, designed in [6], can track many facial features in real-time, like eye, eyebrow, mouth, and jaw. Chen and Tsai [2] also proposed a method of eye-pair tracking, mouth tracking, and detection of head turning for real-time facial images. They designed a real-time virtual face animation system, which is combined with networks, to implement an application to multi-role avatar broadcasting and an application to web TV by ActiveX technique.. 4.

(17) 1.3 Overview of Proposed Method An overview of the proposed approach is described in this section. First, some definitions of terms used in this study are described in Section 1.3.1. And some assumptions made for this study are listed in Section 1.3.2. Finally a brief description of the proposed method is outlined in Section 1.3.3.. 1.3.1 Definitions of Terms The definitions of some terms used in this study are listed as follows. (1) Neutral Face: MPEG-4 specifies some conditions for a head in its neutral state [13] as follows. 1.. Gaze is in the direction of the Z-axis.. 2.. All face muscles are relaxed.. 3.. Eyelids are tangent to the iris.. 4.. The pupil is one third of the iris diameter.. 5.. The lips are in contact.. 6.. The line of the lips is horizontal and at the same height of lip corners.. 7.. The mouth is closed and the upper teeth touch the lower ones.. 8.. The tongue is flat and horizontal with the tip of the tongue touching the boundary between the upper and lower teeth.. In this thesis, a face with a normal expression is called a neutral face. (2) Neutral Facial Image: A neutral facial image is an image with a frontal and straight neutral face in it. (3) Facial Features: In the proposed system, we care about several features of the face, including hair, face, eyebrows, eyes, nose, mouth, and ears of each facial image. 5.

(18) (4) Facial Action Units (FAUs): Facial Action Coding System (FACS) [14] defines 66 basic Facial Action Units (FAUs). The major part of FAUs represents primary movements of facial muscles in action such as raising eyebrows, blinking, and talking. Other FAUs represent head and eye movements. (5) Facial Expression: A facial expression is a facial aspect representative of feeling. Here, Facial expressions include emotions and lip movements. Facial expressions can be described as combinations of FAUs. (6) FAPUs: Facial animation parameter units (FAPUs) are the fractions of the distances between some facial features, like eye separation, mouth width, and so on. (7) Face Model: A face model is a 2D cartoon face model with 74 feature points and some FAPUs, including hair, eyebrows, eyes, noses, mouths, and so on. (8) Face Model Control Points: These points are some of the 74 feature points of the face model. They are used to control many features of the face model, like eyebrow raising, eye opening, lip movement, head tilting, and head turning. (9) Phoneme: A phoneme is a basic enunciation of a language. For example, ㄉ, ㄨ, ㄞ are phonemes in Mandarin. (10) Syllable: A syllable consists of phonemes. For example, ㄊㄢ , ㄍㄣ are syllables in Mandarin. (11) Viseme: A viseme is the visual counterpart of a phoneme. (12) Speech Analyzer: A speech analyzer receives a speech file and a script file as input, and applies speech recognition techniques to get the timing information of each syllable. (13) Key Frame: A key frame can be any frame with the timeline feature in the animation at which you can exactly control the look of a cartoon face.. 6.

(19) (14) Hidden Markov Model (HMM): The HMM is used to characterize the spectral properties of the frames of a speech pattern. (15) Transcript: A transcript is a text file that contains the corresponding content of a speech.. 1.3.2 Assumptions In real situation, it is not easy to simulate a 3D rotation when lacking the depth information. Because in the proposed system, only 2D cartoon face models and audio data are required to input, the inappropriate properties of these data may influence the result. For example, noise in the audio may affect the result of syllable segmentation. And unusual distribution of the facial features will cause exceptions in the 3D coordinate generating process. In order to reduce the complexity of processing works in the proposed system, a few assumptions and restrictions are made in this study. They are described as follows. (1) The recording environment of the input audio is noiseless. (2) The speech is spoken at a steady speed and in a loud voice. (3) The face of the model always faces the camera, with the rotation angle of the face about the three Cartesian axes does not exceed ±15° when naturally speaking. (4) The face of the model has smooth facial features.. 1.3.3 Brief Descriptions of Proposed Method In the proposed system, four major parts are included: a cartoon face creator, a speech analyzer, an animation editor, and an animation and webpage generator. The cartoon face creator creates cartoon faces from neutral facial images or cartoon face models. The speech analyzer segments the speech file into sentences and then performs the speech-text alignment for lip synchronization. The animation editor 7.

(20) allows users to edit facial actions such as head movements and eyebrows raises in the animation. The animation and webpage generator renders cartoon faces and generates webpages with embedded animation. A configuration of proposed system is shown in Figure 1.1.. Figure 1.1 Configuration of proposed system.. We use a web camera to capture a single neutral facial image, and then utilize the proposed cartoon face creator to create a personal cartoon face. The positions of feature points and the values of some FAPUs can be saved as a face model, which can be loaded as input by the cartoon face creator to create a personal cartoon face, too. After the personal cartoon face is created, users may use a speech file and a script file as inputs of the speech analyzer, which gets the timing information of each syllable in the speech file. Then the animation editor automatically synthesizes lip movements according to the timing information. Users may specify facial actions or generate them automatically by the animation editor. Finally, the animation and webpage generator will output an animation file of a talking cartoon face and a webpage file. 8.

(21) with the animation embedded in it.. 1.4 Contributions Some major contributions of the study are listed as follows. (1) A complete system for automatically creating personal talking cartoon faces is proposed. (2) A method for construction of 3D cartoon face models based on 2D cartoon face models is proposed. (3) A method for simulation of head tilting and turning using 3D rotation techniques is proposed. (4) Some methods for automatically gathering audio features for speech segmentation are proposed. (5) A method for simulation of the probabilistic head movements and basic emotions is proposed. (6) Several new applications are proposed and implemented by using the proposed system.. 1.5 Thesis Organization The remainder of the thesis is organized as follows. In Chapter 2, the proposed method of construction of 3D cartoon face models based on 2D cartoon face models and a method of creation of virtual cartoon faces are described. In Chapter 3, the proposed method of speech segmentation for lip synchronization is described. In 9.

(22) Chapter 4, the proposed method of simulating facial expressions and head movements is described. And then, some animation issues such as lip movements and smoothing of talking cartoon facial animation are discussed and solved in Chapter 5. Up to Chapter 5, talking cartoon faces are generated. A final integration using an open standard language SVG (Scalable Vector Graphics) to generate web-based animations is described in Chapter 6. Some examples of applications using the proposed system are presented in Chapter 7. Finally, conclusions and some suggestions for future works are included in Chapter 8.. 10.

(23) Chapter 2 Cartoon Face Generation and Modeling from Single Images. 2.1 Introduction To animate a cartoon face much livelier, many issues are of great concern to us, such as lip movements, eye blinks, eyebrow movements, and head movements. Especially for simulation of head movements, including head tilting and head turning, a 2D face model is not enough to synthesize proper head poses of cartoon faces. Due to this reason, a 3D face model must be constructed to handle this problem. In the proposed system, one of the four major parts shown in Figure 1.1, which is named cartoon face creator, is designed to create personal cartoon faces, integrating the technique of 3D face model construction. In the creation process, three main steps are included. The first step is to assign facial feature points to a 2D face model. It can be done in two ways. One is to detect facial features of an input neutral facial image, generate corresponding feature points, and map them to the feature points in the predefined face model. The other is to directly assign the feature points according to the input 2D face data. In this study, we adopt both ways in constructing our face model. The second step is to construct the local coordinate system of the face model for applying 3D rotation techniques. By creating a transformation between the global and the local coordinate systems and assigning the position of the feature points in the. 11.

(24) third dimension, namely, the Cartesian z-coordinate, this step can be done, and then essential head movements can be simulated. The last step is to define basic facial expression parameters for use in face animation. In this chapter, some techniques are proposed to achieve the purpose mentioned above. First, a review of Chen and Tsai [1] about constructing a 2D face model from single images is presented in Section 2.2. Second, a technique to construct a 3D face model based on the 2D face model is proposed in Section 2.3. In Section 2.4, a technique is proposed to create the cartoon face with different expressions in different poses.. 2.2 Review of Adopted Cartoon Face Model Chen and Tsai [1] proposed an automatic method for generation of personal cartoon faces from a neutral facial image. In their method, three main steps are carried out: extraction of facial feature regions, extraction of facial feature points, and creation of a face model. In the first step, a hierarchical bi-level thresholding method is used to extract the background, hair, and face regions in a given face image. A flowchart of the hierarchical bi-level thresholding method is shown in Figure 2.1. Then, by finding all probable pairs of eye regions according to a set of rules related to the region’s heights, widths, etc., and filtering these regions according to the symmetry of the two regions in each pair, an optimal eye-pair can be detected. Taking the positions of the detected optimal eye-pair as a reference, the facial feature regions can be extracted by a knowledge-based edge detection technique.. 12.

(25) Figure 2.1 Flowchart of hierarchical bi-level thresholding method in Chen and Tsai [1].. Before extracting facial feature points, a face model with facial feature points must be defined first. Because the 84 feature points and the facial animation parameter units (FAPUs) of the face model specified in the MPEG-4 standard are not suitable for cartoon face drawing, Chen and Tsai [1] defined a face model with 72 feature points by adding or eliminating some feature points of the face model in MPEG-4. Some FAPUs were also specified according to the MPEG-4 standard, and then a new adoptable face model was set up. An illustration of the proposed face model is shown in Figure 2.2. In order to control the facial expression of the cartoon face, some feature points were assigned to be control points which are listed as follows: 1. Eyebrow Control Points: there are 8 control points in both eyebrows, namely, 4.2, 4.4, 4.4a, 4.6, 4.1, 4.3, 4.3a, and 4.5. 2. Eye Control Points: there are 4 control points in eyes, namely, 3.1, 3.3, .3.2, and 3.4. 3. Mouth Control Points: there are 4 control points in the mouth, namely, 8.9, 8.4, 8.3, and 8.2, by which other mouth feature points are computed. 4. Jaw Control Point: there is 1 control point in the jaw, namely, 2.1, which is automatically computed by the position of the control point 8.2 and the value of the facial animation parameter JawH.. 13.

(26) These control points in this study are the so-called face model control points. After setting up the face model with 72 feature points, the corresponding feature points in a given facial image can be extracted from the previously mentioned facial feature regions.. (a). (b). Figure 2.2 A face model. (a) Proposed 72 feature points. (b) Proposed facial animation parameter units in Chen and Tsai [1].. Finally, two curve drawing methods are applied to create cartoon faces. One is the corner-cutting subdivision method, in which a subdivision curve is generated by repeatedly cutting off corners of a polygon until a certain condition is reached, as shown in Figure 2.3. The other is the cubic Bezier curve approximation method, which is used to produce smooth curves with a simple polynomial equation, as shown in Figure 2.4.. Figure 2.3 An illustration of corner-cutting algorithm in Chen and Tsai [1]. 14.

(27) Figure 2.4 Cubic Bezier curve in Chen and Tsai [1].. 2.3 Construction of 3D Face Model Based on 2D Cartoon Face Model In this section, the basic idea of constructing a 3D face model based on the above-mentioned 2D cartoon face model is introduced in Section 2.3.1. And the detail of the construction process is described in Section 2.3.2.. 2.3.1 Basic Idea Based on the face model mentioned in Section 2.2, a method is proposed to construct a 3D face model. The method can be divided into two steps: the first is to construct a local coordinate system from the global one, and the second is to assign the position of the feature points in the Cartesian z-direction. The basic idea for constructing a local coordinate system is to define a rotation origin and transform the points of the global coordinate system into those of the local one based on the rotation origin. The basic idea for assigning the position of the feature points in the Cartesian z-direction is to do the assignment based on a proposed generic model. Although a 15.

(28) generic model cannot represent all cases of human faces, it is practical enough in the application of generating virtual talking faces, because in real cases, one usually does not roll his/her head violently when giving a speech. With the assumption that heads are rotated slightly when speaking, a little inaccuracy of the depth information in a face model would not affect the result much, so we can then easily generate different head poses of the face model by a 3D rotation technique after the 3D face model is constructed.. 2.3.2 Construction Process The first step to construct a 3D face model is to construct a local coordinate system. As mentioned above, to achieve this goal, the first issue is to define a rotation origin. The ideal position of the rotation origin is the center of the neck, so we propose a knowledge-based method to define its position according to the position of the eyes. Some definitions of the terms used in this section are listed first as follows: . Eyeball Left / Right .x is the x position of the center of the left/right eyeball circle;. . Eyeball Left / Right . y is the y position of the center of the left/right eyeball circle;. . EyeMid is the position (x, y) of the center between EyeballLeft and EyeballRight.. The green dot shown in Figure 2.5 represents the point EyeMid. After computing the position of EyeMid and making use of the FAPU d in the face model which denotes the Euclidean distance between two eyeballs, we can set the rotation origin and create a the transformation between the global coordinate system and the local one, as shown in Figure 2.5. However, before we start the transformation, some additional points (as shown in Figure 2.6) must be defined to help drawing the cartoon face in different poses, as we 16.

(29) will describe later in Section 2.4. These points will be also transformed into the local coordinate system, so we must set up their positions before the transformation is started. And the detailed method for setting up the additional points and conducting the transformation is expressed as an algorithm in the following.. 11.1. 11.3a. 11.2a 11.2. 4.4a. 4.6. 4.3. 4.4 4.2 4.1. 3.1. 3.2. 11.3. 4.3a. 4.5. 9.6 9.7 10.2. 3.12. 10.10. 10.8. 3.4. 3.11. 3.8. 9.14. 3.3. 9.13. 9.2 9.1 9.4 9.5 9.15 8.9 8.1 8.10. 8.4 2.14. 2.2 2.3. 10.1. 3.7 10.9. 10.7. 8.3 2.13. 8.2 2.1. Figure 2.5 Two coordinate systems. The lines drawn in black color represent the global coordinate system, and those drawn in red color represent the local one.. Figure 2.6 Points to help drawing.. 17.

(30) Algorithm 2.1. Knowledge-based coordinate system transformation. Input: One point EyeMid, some FAPUs, including d, EyebrowH, and EyebrowH2, and 72 model points in the global coordinate system.. Output: A rotation origin O, 17 additional points, and 72 model points in the local coordinate system.. Steps: 1.. Let Wear denote the distance between the EyeMid and an ear in the face model.. 2.. Let xp and yp denote the x-position and the y-position of a point P in the face model.. 3.. Speculate a rotation origin O(xo, yo) to represent the center of the neck with. xo = EyeMid .x ; yo = EyeMid . y + d × 1.3 . 4.. Set the additional points A(xa, ya), B(xb, yb), C(xc, yc), D(xd, yd), E(xe, ye), F(xf, yf), G(xg, yg), H(xh, yh), I(xi, yi), J(xj, yj), K(xk, yk), L(xl, yl), M(xm, ym), N(xn, yn), Q(xq, yq), R(xr, yr), and S(xs, ys) in the following way: xa = EyeMid.x − Wear,. ya = (4 × y10.2 + y10.8)/5;. xb = EyeMid.x + Wear,. yb = (4 × y10.1 + y10.7)/5;. xc = (x9.13 + x9.14)/2,. yc = (y9.14 + EyeMid.y)/2.:. xd = (5 × x9.4 + x9.5)/6,. yd = y9.4 − (y9.4 − y9.14)/3;. xe = (5 × x9.5 + x9.4)/6,. yd = y9.5 − (y9.5 − y9.13)/3;. xf = x4.4a,. yf = y4.4a + EyebrowH2/1.3;. xg = x4.4,. yg = y4.4 + EyebrowH/1.3;. xh = x4.3,. yh = y4.3 + EyebrowH/1.3; 18.

(31) xi = x4.3a,. yi = y4.3a + EyebrowH2/1.3;. xj = (5 × x2.14 + x2.1)/6,. yj = y2.14;. xk = (5 × x2.13 + x2.1)/6,. yk = y2.13;. xl = (x3.12 + x10.10)/2, xm = (x3.7 + x10.9)/2,. 5.. xn = x11.2,. yn = y11.2;. xq = x11.3,. yq = y11.3;. xr = x8.4,. yr = y8.4;. xs = x8.5,. ys = y8.5.. yl = y10.10; ym = y10.9;. For each of the additional points and the 72 model points P(xp, yp), set the point (xp, yp) in the following way:. x p = x p − xo ; y p = yo − y p . The second step to construct a 3D face model is to assign the position of the points in the Cartesian z-direction. A generic model is proposed as the reference for the assignment. To generate the generic model, two orthogonal photographs are used, as shown in Figure 2.7. By calculating the Euclidean distance d between two eyeballs and the distance d’ between the y-position of EyeMid and the y-position of the feature point 2.2 in the front-view image, d’ can be expressed as a constant multiple of d. Here it is shown as 1.03d in the experiment in Figure 2.7(a). Similarly in the side-view image, the distance between EyeMid and the point 2.2 in the y-direction is set to the constant multiple of d, as shown in Figure 2.7(b). By marking all of the viewable points, including the rotation origin and some of the additional points mentioned above, and computing the distance in the z-direction between the origin and each of the points in the image, the positions of the points in the z-direction can 19.

(32) be computed as a constant multiple of d, too. For those points which are not viewable, based on the symmetry of the human face, their positions can be also assigned. After some adjustments and experiments, the values of the points in the z-direction adopted in this study is listed in Table 2.1.. (a). (b). Figure 2.7 Two orthogonal photographs. (a) Front view. (b) Side view. Table 2.1 The values of the points in the z-direction.. Category Hair. Eyes. Eyebrows. Mouth. Jaw. Point. Value. Category. Point. Value. All hair points. -0.58d. 11.1. 1.37d. 3.2, 3.4, 3.1, 3.3,. 1.33d. 11.2a, 11.3a. 0.90d. Forehead. EyeballLeft, EyeballRight. 3.12, 3.7. 1.19d. 11.2, 11.3. 0.42d. 3.8, 3.11. 1.25d. N, Q. 0.85d. 4.6, 4.5. 1.17d. 10.2, 10.1. -0.37d. 4.4a, 4.3a, F, I. 1.30d. 10.8, 10.7. 0d. 4.4, 4.3, G, H. 1.44d. A, B. 0.04d. 4.2, 4.1. 1.49d. 9.6, 9.7, 9.2, 9.1. 1.39d. 2.2, 2.3. 1.55d. 9.14, 9.13. 1.40d. 8.4, 8.3, R, S. 1.40d. 9.4, 9.5. 1.44d. 8.1, 8.9, 8.10. 1.60d. 9.15. 1.65d. 8.2. 1.56d. C. 1.51d. 2.14, 2.13. 0.36d. D, E. 1.53d. 2.1. 1.33d. 10.9, 10.10. 0.27d. J, K. 1.18d. L, M. 1.11d. 20. Ears. Nose. Cheek.

(33) After the two steps are done, a 3D face model is constructed. We consider d as a length unit in the face model, and we can easily change the scale and the position of a 3D face model by changing its origin and the reference d value. The scheme is useful for normalization between different faces in different scales and positions. For example, if there is a face model whose d value is a certain constant c, and we want to scale it to a larger size with the value d being another constant c’, we can just apply the geometric ratio principle to multiply the position of each point and each FAPU by a factor of c’/c.. 2.4 Creation of Cartoon Face As mentioned in Section 2.2, the cartoon face is created by the corner-cutting subdivision and the cubic Bezier curve approximation methods. In this section, two types of cartoon faces are introduced, one being the frontal cartoon face and the other the oblique cartoon face. It is hoped that by the two types of cartoon faces, a head-turning talking cartoon face can be represented smoothly.. 2.4.1 Creation of Frontal Cartoon Face A frontal cartoon face is drawn by the 72 feature points and some of the additional points mentioned previously. Let O(xo, yo) denote the position of the rotation origin in the face model. Before the creation process, for each of the additional points and the 72 model points P(xp, yp), the position of P must be transformed into the global coordinate system in the following way:. x p = x p + xo ;. y p = yo − y p .. 21.

(34) After the transformation, the cartoon face can be drawn in the global coordinate system. The detail of the proposed frontal face creation method is described in the following as an algorithm.. Algorithm 2.2. Creation of frontal cartoon face. Input: 72 feature points, 17 additional points, and some FAPUs, including the radii of the eyeballs r1 and r2 in the face model.. Output: an image of the frontal cartoon face. Steps: 1. Let arc(P1, …, Pn) denote a curve composed by the points P1, …, Pn.. Figure 2.8 An illustration of arc(P1, …, Pn).. 2. Draw the contour of the hair by a polygon composed by 23 hair feature points. 3. Draw the contour of the face, including the forehead, cheek, and jaw, by the cubic Bezier curves arc(11.2, 11.2a, 11.1, 11.3a, 11.3), arc(11.3, 10.9, 2.13), arc(2.13, 2.1, 2.14), and arc(2.14, 10.10, 11.2). 4. Draw the contour of the left ear by the cubic Bezier curves arc(10.8, 10.2, A). 5. Draw the contour of the right ear in a similar way. 6. Draw the contour of the nose by the cubic Bezier curves arc(9.6, C, 9.14), arc(9.14, 9.2, 9.4), arc(9.13, 9.1, 9.5), and arc(D, 9.15, E). 7. Draw the contour of the left eyebrow by the corner-cutting subdivision curves arc(4.6, 4.4a, 4.4, 4.2) and arc(4.2, G, F, 4.6). 8. Draw the contour of the right eyebrow in a similar way.. 22.

(35) 9. Draw the contour of the left eye by the cubic Bezier curves arc(3.12, 3.2, 3.8), arc(3.8, 3.4, 3.12). 10. Draw the contour of the right eye in a similar way. 11. Draw a circle with the radius r1 and the center at EyeballLeft representative of the left eyeball. 12. Draw a circle of the right eyeball in a similar way. 13. Draw the contour of the mouth by the cubic Bezier curves arc(8.1, 8.9, 8.4), arc(8.4, 8.2, 8.3), arc(8.3, 8.10, 8.1), arc(R, 2.2, S), and arc(S, 2.3, R). 14. Fill the predefined colors into their corresponding parts. An illustration of the steps in the creation of the frontal cartoon face is shown in Figure 2.9. An experimental result of the creation of a frontal cartoon face is shown in Figure 2.10.. (a). (b). (c). (d). Figure 2.9 An illustration of the steps in the creation of the frontal cartoon face. (a) The creation of the contour of a face. (b) The creation of the ears. (c) The creation of the nose. (d) The creation of the eyebrows. (e) The creation of the eyes. (f) The creation of the mouth. 23.

(36) (e). (f). Figure 2.9 An illustration of the steps in the creation of the frontal cartoon face. (a) The creation of the contour of a face. (b) The creation of the ears. (c) The creation of the nose. (d) The creation of the eyebrows. (e) The creation of the eyes. (f) The creation of the mouth. (continued). (a). (b). Figure 2.10 An experimental result of the creation of a frontal cartoon face. (a) A male face model. (b) A female face model.. 2.4.2 Generation of Basic Facial Expressions After a frontal cartoon face is created, we are concerned about how to generate some basic facial expressions to make the face livelier. Facial Action Coding System (FACS) [14] defines some basic Facial Action Units (FAUs), which represents primary movements of facial muscles in actions such as raising eyebrows, blinking, talking, etc. The FACS has been useful for describing most important facial actions, so some of the FAUs defined in it are considered to be suitable in the study for 24.

(37) synthesis of facial expressions. For example, the FAU 12, whose description is lip corner puller, can be viewed as a smile. And the FAUs 1 and 2, which respectively represent the inner and outer eyebrow raisings, are the basic facial expressions that frequently happen when one is making a speech. By taking the FAUs as references, we decide to define three basic facial expressions: eye blinking, smiling, and eyebrow raising. For eye blinking, by changing the value of the FAPU LeftEyeH and RightEyeH, and setting up the positions of four model eye points 3.2, 3.4, 3.1, and 3.3 according to these two FAPUs, we can easily generate an eye blinking effect. An experimental result is shown in Figure 2.11.. Figure 2.11 An experimental result of generation of an eye blinking effect.. Similarly, by changing the positions of two model mouth points 8.4 and 8.3 according to the FAPU UpperLipH, a smiling effect can be created. In the meanwhile, by modifying the positions of two model eye points 3.4 and 3.3 based on the FAPUs LeftEyeH and RightEyeH, a squinting effect can be combined into the cartoon face to 25.

(38) make the smiling more vivid. An experimental result is shown in Figure 2.12.. Figure 2.12 An experimental result of generation of a smiling effect.. For eyebrow raising, all of the 8 model eyebrow points and 4 additional eyebrow points are involved. By regulating the positions of these points according to the FAPU EyebrowH, an eyebrow raising effect can be generated. An experimental result is shown in Figure 2.13.. Figure 2.13 An experimental result of generation of an eyebrow raising effect.. 2.4.3 Creation of Oblique Cartoon Face The basic idea of creation of an oblique cartoon face is to rotate the 3D face model on the three Cartesian axes in the local coordinate system. After rotation, the 26.

(39) 3D points are projected to the X-Y plane and transformed into the global local system. Then the cartoon face can be illustrated by the previously-mentioned corner-cutting subdivision and cubic Bezier curve approximation methods. In this section, a review of a 3D rotation technique is presented in Section 2.4.3.1. A simulation of eyeballs gazing at a fixed target while the head is turning is described in Section 2.4.3.2. At last, the creation process, including some methods to solve the additional problems while drawing, is described in Section 2.4.3.3.. 2.4.3.1. Review of 3D Rotation Technique. Suppose that a point in a 3D space, which is denoted by (x, y, z), is rotated on the three Cartesian axes respectively, as shown in Figure 2.14. Figure 2.14 An illustration of a point rotated on the three Cartesian axes.. We define positive angles to be representative of counter-clockwise rotations, and negative ones representative of clockwise rotations. Two basic trigonometric equations as follows are used as the 3D rotation formula: sin(θ + β ) = sin θ × cos β + cos θ × sin β ; cos(θ + β ) = cos θ × cos β − sin θ × sin β .. Suppose the point is first rotated on the Y axis, so the y coordinate will not be changed. It is assumed that after projecting the point onto the X-Z plane, the distance. 27.

(40) between the point and the origin is L, as shown in Figure 2.15.. Figure 2.15 An illustration of a point rotated on the Y axis.. Then the equations above can be transformed to be as follows:. x1 x z = × cos β + × sin β ; L L L. z1 z x = × cos β − × sin β . L L L After canceling L, the formula for the point rotated on the Y-axis can be derived to be as follows:.  x1 = x × cos β + z × sin β ;   y1 = y;  z = − x × sin β + z × cos β .  1 Similarly, the formula for the point rotated on the X- and Z-axes are derived as follows, respectively:.  x2 = x1 ;   y2 = y1 × cos α − z1 × sin α ;  z = y × sin α − z × cos α . 1 1  2  x3 = x2 × cos γ − y2 × sin γ ;   y3 = x2 × sin γ + y2 × cos γ ; z = z . 2  3 Finally, projecting the point (x3, y3, z3) to the X-Y plane, we can get the new position of the point after the rotation is performed.. 28.

(41) 2.4.3.2. Simulation of Eyeballs Gazing at a Fixed Target. The basic idea to simulate the eyeballs gazing at a fixed target is to set up a point representative of the focus of the eyes in the local coordinate system of the face model. By speculating the radius of the eyeball, the position of the eyeball center can be computed by the position of the pupil and the focus. Then for every rotation performed in the creation process, the new position of the eyeball center is also calculated. And the new position of the pupil can be computed by the position of the eyeball center and the focus. In this study, the speculated radius of the eyeball is set to be 0.3d, and the position of the focus is (EyeMid.x, EyeMid.y , 15d). An illustration of the focus and eyeballs is shown in Figure 2.16.. (a). (b). Figure 2.16 An illustration of the focus and eyeballs. (a) Before rotation. (b) After rotation.. 2.4.3.3. Creation Process. An oblique cartoon face is drawn by the 72 feature points and some of the. 29.

(42) additional points mentioned previously. The creation process is similar to the one of the frontal cartoon face, but the difference is that it must be done with some additional steps, including the rotation step. Furthermore, there are some problems after applying the rotation technique. One of the problems is that the face contour will become deformed, because some of the face contour points will be hidden and not viewable after the head is turned, and they cannot represent the face contour point any more. Therefore, we must use some other points instead of them. Another problem is that the depth of the hair contour points is defined in a flat plane, which would look unreal after the rotation, as shown in Figure 2.17. We propose a method to solve these problems, which is to change the depth of some of these points before the rotation according to the rotation direction and the angle. The detail of the proposed oblique face creation method is described in the following algorithm.. Figure 2.17 An illustration of the unreality of the hair contour.. Algorithm 2.3. Creation of oblique cartoon face. Input: 72 feature points, 17 additional points, a rotation origin O(xo, yo) in the global coordinate system, some FAPUs, including the radii of the eyeballs r1 and r2 in the face model, and 3 rotation angles α, β, and γ in degrees around X-, Y-, and Z-axes.. Output: an image of an oblique cartoon face. Steps: 30.

(43) 1. If β is larger than 0, for each of the hair point Phair(xph, yph, zph) in the right half of the face model where y ph ≥ EyeMid . y , add a constant multiple of d to zph according to the value of xph and β.. (a). (b). Figure 2.18 An illustration of the shift of hair contour points. (a) Before rotation. (b) After rotation.. 2. If β is smaller than 0, shift the hair points in a similar way. 3. For each of the additional points and the 72 model points P(xp, yp, zp), apply the rotation technique in Section 2.4.3.1 to get a new point P’(xp’, yp’) on the X-Y plane. 4. For each of the points P’(xp’, yp’), transform the position of P’ into the global coordinate system in the following way:. x p ' = x p ' + xo ,. y p ' = yo − y p ' .. 5. If β is larger than 10 degrees, replace the model points 11.3, 10.9, and 2.13 by Q, M, and K, respectively. 6. If β is smaller than -10 degrees, replace the model points 11.2, 10.10, and 2.14 by N, L, and J, respectively. 7. Apply Algorithm 2.2 to create the desired oblique cartoon face. 8. In Step 7, if β is larger than 0, draw the contour of the nose by the cubic Bezier curves arc(9.7, C, 9.13), arc(9.14, 9.2, 9.4), arc(9.13, 9.1, 9.5),. 31.

(44) and arc(D, 9.15, E). An illustration of the creation of oblique cartoon faces is shown in Figure 2.19. (a). (b). Figure 2.19 An illustration of creation of oblique cartoon faces. (a) An oblique cartoon face with β = 15 degrees. (b) An oblique cartoon face with β = −15 degrees.. 2.5 Experimental Results Some experimental results of creating cartoon faces in different poses and with different facial expressions are shown in Figure 2.20.. Figure 2.20 An example of experimental results for creation of cartoon faces in different poses with different facial expressions.. 32.

(45) Figure 2.20 An example of experimental results for creation of cartoon faces in different poses with different facial expressions. (continued). 33.

(46) Chapter 3 Speech Segmentation for Lip Synchronization. 3.1 Introduction to Lip Synchronization for Talking Cartoon Faces The main purpose of this study is to establish a system which can be used to generate a speech-driven synchronized talking cartoon face. In Chapter 2, we have described how we construct a cartoon face model, which can be used to animate a moving face by changing the positions of its control points according to some of its FAPUs. The next issue is to control the lip movement by analyzing the speech track and gathering the timing information, namely, the duration, of each syllable in the speech. In the proposed system, one of the four major parts shown in Figure 1.1, which is named speech analyzer, is designed to achieve this goal. The speech analyzer receives a speech file and a script file, which is called a transcript in this study, and applies speech recognition techniques to get the timing information of each syllable. A flowchart of the proposed speech analyzer is shown in Figure 3.1. A transcript is usually composed of many sentences. Although it is feasible to directly get the timing information of each syllable from the speech of the entire transcript without segmentation of the sentence utterances, it will take too much time to do so if the input audio is long. Therefore, by segmenting the entire audio into sentence utterances as the first step and then processing each segmented shorter. 34.

(47) sentence utterance piece sequentially to extract the duration of each syllable in the speech, the overall processing speed can be accelerated. Some audio features mentioned above are listed in Table 3.1.. Transcript. Speech. Segmentation of Sentence Utterances Segmentation of Mandarin Syllables Speech Analyzer. Syllable Timing Information. Figure 3.1 A flowchart of proposed speech analysis process. Table 3.1 Descriptions of audio features.. Example. Description. Feature. Speech of. A speech that contains the audio data of the. Transcript. entire transcript including many sentences.. 或許你已聽過，多補充抗氧化劑可以延緩老化。但真相為何？. Speech of Sentence Utterance. Speech of Syllable. A speech that contains the audio data of a single sentence utterance including several syllables.. A speech that contains the audio data of a single syllable.. 或許你已聽過。. ㄏㄨㄛ、ㄒㄩ、ㄋㄧ、ㄧ、ㄊㄧㄥ、ㄍㄨㄛ. In Section 3.2 , a method for segmentation of sentence utterances is proposed. In Section 3.3, the process of Mandarin syllable segmentation is described.. 35.

(48) 3.2 Segmentation of Sentence Utterances by Silence Feature 3.2.1 Review of Adopted Segmentation Method In Lai and Tsai [4], a video recording process was designed to extract necessary feature information from a human model to generate a virtual face animation. In the recording process, the model should keep his/her head facing straightly to the camera, shake his/her head slightly for a predefined period of time while keeping silent, and read aloud the sentences on the transcript one after another, each followed by a predefined period of silent pause. An example of diagrams of recorded video contents and corresponding taken actions is shown in Figure 3.2.. Actions. Shake head. Read sentence 1. Pause. Read sentence 2. Pause. Read sentence 3. Audios. Images. Figure 3.2 An example of recorded video contents and corresponding actions in Lai and Tsai [4].. The model keeps silent for predefined periods of time in two situations, as we can see in Figure 3.2. One is the initial silence where the model keeps silent while shaking his/her head, and the other is the intermediate silence where the model keeps silent in pauses between sentences. In order to segment the speech into sentence utterances automatically, the silence features are used in their method to detect the. 36.

(49) silence parts between sentences and perform automatic segmentation of sentence utterances based on these detected silence parts. Due to the environment noise, the volume of these silence parts usually is not zero. Therefore, the positions of silences cannot be detected by simply searching zero-volume zones. So in their method, the maximum volume of the environment noise is measured first in the initial silence period and is then used as a threshold value to determine the intermediate silence parts by searching for the audio parts whose volumes are smaller than the threshold value. Lai and Tsai also defined another threshold to be representative of the minimum length of the intermediate silence. Because short pauses between syllables in a sentence may be viewed as silences, the minimum duration of pauses between sentences, i.e. the minimum length of intermediate silences, should be designed to be much longer than the duration between syllables to avoid incorrect detections. In such ways, the intermediate silence parts can be found, and the sentence utterances can be segmented, to a rather high degree of precision.. 3.2.2 Segmentation Process In Lai and Tsai’s method, speech can be segmented into sentence utterances by silence features. However, the process of measuring the environment noise is not suitable for real cases because the duration of the initial silence period is unknown. Even in some cases, there is no initial silence period before the speaker starts to talk. Hence, the maximum volume of the environment noise cannot always be measured by adopting their method. Moreover, the duration of pauses between sentences is different for different speakers. So the silence features, including the maximum volume of the environment noise and the duration of pauses between sentences, must be learned in another way. In the proposed system, an interface is designed to let users select the pause 37.

(50) between the first sentence and the second one from the input audio. Then the silence features can be learned according to the selected part of audio. A flowchart of the proposed sentence segmentation process is shown in Figure 3.3. The entire process of sentence utterance segmentation is described as an algorithm in the following.. Figure 3.3 A flowchart of the sentence utterance segmentation process.. Algorithm 3.1. Segmentation of sentence utterances. Input: A speech file Stranscript of the entire transcript. Output: Several audio parts of sentence utterances Ssentence1, Ssentence2, etc. Steps: 1. Select the start time ts and the end time te of the first intermediate silence in Stranscript by hand. 2. Find the maximum volume V appearing in the audio part within the selected first intermediate silence.. 38.

(51) 3. Set the minimum duration of the intermediate silence Dpause as. (te − t s ) × c1 , where c1 is a constant between 0.5 and 1. 4. Set the maximum volume of environment noise Vnoise as V × c2 , where c2 is a constant between 1 and 1.5. 5. Start from ts to find all continuous audio parts Ssilence whose volume are smaller than Vnoise and last longer than Dpause. 6. Find a continuous audio part Ssentence, called a speaking part, which is not occupied by any Ssilence. 7. Repeat Step 6 until all speaking parts are extracted. 8. Break Stranscript into audio parts of the speaking parts found in Step 7.. Since we assume that the speech is spoken at a steady speed, the durations of the other intermediate silences are considered to be close to the first one. Therefore, c1 in Step 3 is chosen to be 0.95. Furthermore, since we assume that the speech is spoken in a loud voice and the recording environment of the input audio is noiseless, the volume of speaking parts is considered to be much larger than that of environment noise. To avoid misses of detecting silent parts, c2 in Step 4 is chosen to be a larger value 1.45. An example of selecting the first silent part in an input audio is shown in Figure 3.4. The red part represents the selected silence period between the first sentence and the second one. An example of experimental results of the proposed segmentation algorithm is shown in Figure 3.5. The blue and green parts represent odd and even sentences, respectively. As we can see, speaking parts of the input audio are extracted correctly.. 39.

(52) Figure 3.4 An example of selecting the first silent part in an input audio.. Figure 3.5 An example of sentence utterances segmentation results. The blue and green parts represent odd and even speaking parts, respectively.. 3.3 Mandarin Syllable Segmentation 3.3.1 Review of Adopted Method After the speech of the entire transcript is segmented into sentence utterances, the timing information of each syllable in a sentence can be extracted by speech recognition techniques. One of speech recognition techniques, called speech alignment, produces recognition results with higher accuracy because the syllables. 40.

(53) spoken in input speeches are known in advance. In this study, a speech alignment technique using the Hidden Markov Model (HMM) is adopted to extract the timing information of syllables. The HMM was widely used for speech recognition. It is a kind of statistical methods which is useful for characterizing the spectral properties of the frames of a speech pattern. In Lin and Tsai [3], a sub-syllable model was adopted together with the HMM for recognition of Mandarin syllables. After the construction of the sub-syllable model, the Viterbi search is used to segment the utterance. Then the timing information of each syllable in the input audio can be extracted.. 3.3.2 Segmentation Process In the proposed system, an entire transcript is segmented into sentences by punctuation marks. For each sentence, the Mandarin characters are transformed into their corresponding syllables according to a pre-constructed database. If a character has multiple pronunciations, its correct syllable is selected by hand. Then each sentence utterance is aligned with its corresponding syllables. Finally, the timing information for each sentences utterance is combined into a global timeline. A flowchart of the Mandarin syllable segmentation process is shown in Figure 3.6.. 3.4 Experimental Results Some experimental results of applying the proposed method for extracting the timing information of Mandarin syllables in the speech of an entire transcript are shown here. Two examples of entire audios of a transcript are shown in Figure 3.7 and Figure 3.9, and their corresponding results of syllable alignment are shown in Figure 41.

(54) 3.8 and Figure 3.10. Durations of syllables are shown in blue and green colors.. Figure 3.6 A flowchart of the Mandarin syllable segmentation process.. Figure 3.7 An example of entire audio data of a transcript. The content of the transcript is “或許你已聽過，多補充抗氧化劑可以延緩老化。但真相為何？”.. Figure 3.8 The result of syllable alignment of the audio in Figure 3.7. 42.

(55) Figure 3.9 An example of entire audio data of a transcript. The content of the transcript is “長期下來，傷害不斷累積，就可能造就出一個較老、較脆弱的身體。”.. Figure 3.10 The result of syllable alignment of the audio in Figure 3.9.. 43.