• 沒有找到結果。

Chapter 1 Introduction

1.3 Overview of Proposed Method

An overview of the proposed approach is described in this section. First, some definitions of terms used in this study are described in Section 1.3.1. And some assumptions made for this study are listed in Section 1.3.2. Finally a brief description of the proposed method is outlined in Section 1.3.3.

1.3.1 Definitions of Terms

The definitions of some terms used in this study are listed as follows.

(1) Neutral Face: MPEG-4 specifies some conditions for a head in its neutral state [13] as follows.

1. Gaze is in the direction of the Z-axis.

2. All face muscles are relaxed.

3. Eyelids are tangent to the iris.

4. The pupil is one third of the iris diameter.

5. The lips are in contact.

6. The line of the lips is horizontal and at the same height of lip corners.

7. The mouth is closed and the upper teeth touch the lower ones.

8. The tongue is flat and horizontal with the tip of the tongue touching the boundary between the upper and lower teeth.

In this thesis, a face with a normal expression is called a neutral face.

(2) Neutral Facial Image: A neutral facial image is an image with a frontal and straight neutral face in it.

(3) Facial Features: In the proposed system, we care about several features of the face, including hair, face, eyebrows, eyes, nose, mouth, and ears of each facial image.

(4) Facial Action Units (FAUs): Facial Action Coding System (FACS) [14] defines 66 basic Facial Action Units (FAUs). The major part of FAUs represents primary movements of facial muscles in action such as raising eyebrows, blinking, and talking. Other FAUs represent head and eye movements.

(5) Facial Expression: A facial expression is a facial aspect representative of feeling.

Here, Facial expressions include emotions and lip movements. Facial expressions can be described as combinations of FAUs.

(6) FAPUs: Facial animation parameter units (FAPUs) are the fractions of the distances between some facial features, like eye separation, mouth width, and so on.

(7) Face Model: A face model is a 2D cartoon face model with 74 feature points and some FAPUs, including hair, eyebrows, eyes, noses, mouths, and so on.

(8) Face Model Control Points: These points are some of the 74 feature points of the face model. They are used to control many features of the face model, like eyebrow raising, eye opening, lip movement, head tilting, and head turning.

(9) Phoneme: A phoneme is a basic enunciation of a language. For example, ㄉ, ㄨ, ㄞ are phonemes in Mandarin.

(10) Syllable: A syllable consists of phonemes. For example, ㄊ ㄢ , ㄍ ㄣ are syllables in Mandarin.

(11) Viseme: A viseme is the visual counterpart of a phoneme.

(12) Speech Analyzer: A speech analyzer receives a speech file and a script file as input, and applies speech recognition techniques to get the timing information of each syllable.

(13) Key Frame: A key frame can be any frame with the timeline feature in the animation at which you can exactly control the look of a cartoon face.

(14) Hidden Markov Model (HMM): The HMM is used to characterize the spectral properties of the frames of a speech pattern.

(15) Transcript: A transcript is a text file that contains the corresponding content of a speech.

1.3.2 Assumptions

In real situation, it is not easy to simulate a 3D rotation when lacking the depth information. Because in the proposed system, only 2D cartoon face models and audio data are required to input, the inappropriate properties of these data may influence the result. For example, noise in the audio may affect the result of syllable segmentation.

And unusual distribution of the facial features will cause exceptions in the 3D coordinate generating process. In order to reduce the complexity of processing works in the proposed system, a few assumptions and restrictions are made in this study. They are described as follows.

(1) The recording environment of the input audio is noiseless.

(2) The speech is spoken at a steady speed and in a loud voice.

(3) The face of the model always faces the camera, with the rotation angle of the face about the three Cartesian axes does not exceed ±15° when naturally speaking.

(4) The face of the model has smooth facial features.

1.3.3 Brief Descriptions of Proposed Method

In the proposed system, four major parts are included: a cartoon face creator, a speech analyzer, an animation editor, and an animation and webpage generator. The cartoon face creator creates cartoon faces from neutral facial images or cartoon face models. The speech analyzer segments the speech file into sentences and then performs the speech-text alignment for lip synchronization. The animation editor

allows users to edit facial actions such as head movements and eyebrows raises in the animation. The animation and webpage generator renders cartoon faces and generates webpages with embedded animation. A configuration of proposed system is shown in Figure 1.1.

Figure 1.1 Configuration of proposed system.

We use a web camera to capture a single neutral facial image, and then utilize the proposed cartoon face creator to create a personal cartoon face. The positions of feature points and the values of some FAPUs can be saved as a face model, which can be loaded as input by the cartoon face creator to create a personal cartoon face, too.

After the personal cartoon face is created, users may use a speech file and a script file as inputs of the speech analyzer, which gets the timing information of each syllable in the speech file. Then the animation editor automatically synthesizes lip movements according to the timing information. Users may specify facial actions or generate them automatically by the animation editor. Finally, the animation and webpage generator will output an animation file of a talking cartoon face and a webpage file

with the animation embedded in it.