Thesis Organization - 自動化建構虛擬說話人臉與其相關應用之研究

Chapter 1 Introduction

1.5 Thesis Organization

The remainder of this thesis is organized as follows. Chapter 2 describes an overview of the proposed system and processes. Chapter 3 presents the proposed methods for learning audio features automatically. Chapter 4 presents the proposed methods to locate base regions automatically and precisely. Chapter 5 describes the proposed methods to locate other facial features automatically. Chapter 6 presents the

proposed methods to generate smooth animations. In Chapter 7, some applications using the proposed methods are presented. Finally, conclusions and some suggestions for future research are included in Chapter 8. Experimental results and discussions are given in each chapter.

Chapter 2 System Overview

2.1 System Organization and Processes

As illustrated in Fig 1.1, the proposed system consists of three main processes:

video recording, feature learning, and animation generation. In this section, relations between those processes are described.

For the video recording process, an output of a video containing a speaking model (a person) is desired. The process must be designed carefully, so that following processes can gather enough information from the video for creation of virtual talking faces. The process must also be simple and reasonable. Otherwise, the model may feel uncomfortable. Fig 2.1(a) illustrates the proposed video recording process.

For the feature learning process, an output of facial feature information is desired.

The output of the video recording process is used as the only input. In order to reduce artificial interferences, several methods for automatic extraction of different kinds of features are proposed. Fig 2.1(b) illustrates the feature learning process.

For the animation creation process, outputs of animations containing virtual talking faces are desired. Sound tracks of speech are used as inputs, which control timing information of spoken syllables. Feature information from the feature learning process is also used as an input to help generate final animations. The generated animations must be smooth and realistic. Otherwise, viewers will not be satisfied. The animations may have different forms, such as virtual announcers, virtual singers, virtual teachers, etc. Fig 2.1(c) illustrates the animation creation process.

Recorded

Fig. 2.1 Flowcharts of three main processes. (a) Video recording process. (b) Feature learning process. (c) Animation generation process.

In the following sections, contents of these processes will be described in detail.

2.2 Video Recording Process

In the following sections, matters needing attention in the video recording process are described in detail. In Section 2.2.1, setups of recording environments are discussed. In Section 2.2.2, a transcript for learning of Mandarin syllables is proposed.

In Section 2.2.3, a detailed recording process is proposed.

2.2.1 Environment Setup

The arrangement of the recording environment affects impressions of models severely. Since the models may not be familiar friends of system operators, the recording process should be as simple as possible, so that they will not feel confused or impatient. A too lengthy or complicated recording process is not acceptable.

A scene of our environment setup is shown in Fig 2.2. The proposed recording environment setup is rather simple. The model is seated in front of a camera. A

pre-designed transcript is shown on a screen right behind the camera, and its position is adjusted so that the model can read the transcript without obstacles. As we can see in Chapter 8, the process takes about 2 minutes only, which is quite short.

(b)

(a)

(c)

Fig. 2.2 Scene of proposed environment setup. (a) The model. (b) The transcript. (c) The recorded scene.

In order to capture videos of better qualities, some extra devices are adopted in our environment setup. Introducing these devices doesn’t affect the simplicity of the process, but only add a little workload on system operators. For example, instead of normal webcams, a DigitalVideo device, which is capable of grabbing frames of 720×480 dimensions on a frame rate of 29.97/sec, is used to record videos. Two spotlights are also used. They not only brighten the model, but also reduce blinking effects of fluorescent lamps. Since final animations are generated from recorded frames, an environment with steadier lighting makes the resulting animations look smoother.

2.2.2 Transcript Reading

In this study, we make efforts to create virtual faces that are capable of speaking Chinese words. In [6], Lin and Tsai classified the 411 kinds of Mandarin syllables into 115 classes according to mouth shape similarities. For virtual faces that are able to speak all Mandarin words, learning of these 115 classes of Mandarin syllables is necessary. However, speaking these syllables one by one singly is a somewhat boring work for models. Therefore, we propose a transcript that contains the 115 classes of syllables by 17 sentences, which is shown in Table 2.1. These sentences are designed to be meaningful and short, so models can speak them easily. Efforts are also made to minimize repetitions of syllables in this transcript.

Table 2.1 The proposed transcript that contains 115 classes of Mandarin syllables.

Number Sentence Used syllable classes

1 好朋友自遠方來 35、63、84、2、108、51、23

Table 2.1 The proposed transcript that contains 115 classes of Mandarin syllables.

(Continue)

Number Sentence Used syllable classes

12 老翁和阿婆喝茶 36、106、48、3、12、17、4

After the environment is set up and the transcript is prepared, the recording process can begin. For the convenience of feature learning, some extra works must be done in the recording process. However, these works add only a little workload to system operators and models, which are acceptable.

Firstly, the model should keep his/her head straight to the camera, and then the recording process can begin. Since the first frame of the recorded video is used as a reference frame in the feature learning process, a “straight” face with a normal expression is required. Otherwise, poorer information may be learned in the subsequent feature learning process.

Secondly, after the recording begins, the system operator should instruct the model to shake his/her head for a predefined period of time while keeping silent. The recorded video of this period of time is used as an assist for learning of audio features and base image sequences, which will be described in the following section.

Thirdly, the model is instructed to read aloud the sentences on the transcript one

by one, each followed by a predefined period of silent pause. These pauses are used to help learn audio features. The model should read these sentences loudly, clearly, and slowly, so that syllables can be learned correctly.

A flowchart of the video recording process is illustrated in Fig 2.3. A diagram of the content of the recorded video and the corresponding actions taken is shown in Fig 2.4.

Fig. 2.3 A flowchart of the video recording process.

Shake head Read

Fig. 2.4 A diagram showing the audios and images of the recorded video, and the corresponding actions taken.

2.3 Feature Learning Process

After the video recording process is done, features can be learned from the recorded video. Section 2.3.1 lists required features in the proposed system, and Section 2.3.2 illustrates the learning process for these different features.

2.3.1 Feature Classification

Features required for creation of virtual talking faces can be classified into four types: audio features, base image sequences, facial features, and base regions.

Audio features are timing information of spoken syllables in the recorded video.

Timing information of the total speech, timing information of each sentence, and timing information of each syllable are examples of audio features. These features are used to help synchronize audios and images.

Base image sequences are sequences of facial images, which are used as background images. Mutable facial parts such as mouths can be pasted onto these images to form faces that speak different words. Base image sequences also control ways of shaking heads.

Facial features are special parts of faces, which can be used as natural marks. For example, noses, lips, and jaws are facial features adopted in this study.

Base regions are special facial features that can be used to orient faces. With the help of base regions, the positions and gradients of faces can be calculated. In this study, noses are adopted as the base regions to locate faces.

2.3.2 Learning Process

Fig 2.5 illustrates a flowchart of the feature learning process. First, the recorded video is split into audio data and image frames. With the help of the transcript, audio features can be learned from the audio data. Facial features can be learned directly from image frames. Learning of base image sequences requires the information of both audio data and image frames.

Since these learning processes require dealing with a lot of audio data and image

frames, manual processes are not acceptable. Several methods for learning these features automatically are proposed and explained in Chapters 3 and 4.

Recorded

Fig. 2.5 A flowchart of the feature learning process.

2.4 Animation Generation Process

After features have been collected, animations of virtual faces can be created. In Section 2.4.1, some essential properties for animations are described. The animation generation process is illustrated in Section 2.4.2.

2.4.1 Properties of Animations

To create virtual faces that are capable of improving interfaces between humans and computers, created animations should possess several properties. Firstly, they should have realistic appearances. Interacting with faces that are not realistic is a very

strange and unnatural thing. Secondly, lip movements should be smooth. Since people are familiar with watching others’ lip movements while talking to each other, unnatural movements of lips will be discovered easily. Thirdly, fluent speaking abilities are required. And fourthly, speech and lip movements should be synchronized.

Humans are conscious of asynchronous problems between speech and lip movements.

In this study, created animations by the proposed system possess all of these properties. Firstly, animations are generated from 2D image frames. With good techniques for integration of facial parts, animations look realistic and natural.

Secondly, several methods are proposed to smooth the lip movements. Thirdly, original sound tracks of real people are adopted in final animations, which avoid the problem of unnatural voices. Fourthly, a method is proposed to synchronize speech and lip movements.

2.4.2 Animation Generation Process

The animation generation process requires two inputs: a transcript and its corresponding speech data. First, a person is asked to read a transcript, and the speech is recorded. The process of syllable alignment then extracts the timing information of the syllables in the speech. With the help of timing information of syllables and feature data, proper image frames that are synchronized with speech can be generated.

Finally, animations are generated by composition of images frames and speech data.

Fig 2.6 illustrates a flowchart of the animation generation process.

Speech

Image Frames Syllable Alignment

Syllable Timing Information

Frame Generation

Animation Feature

Database Transcript

Composition Speech Recording

Fig. 2.6 A flowchart of the animation generation process.

Chapter 3 Automatic Learning of Audio

Features and Base Image Sequences 3.1 Learning of Audio Features

In Section 2.3 in the last chapter, four types of features that must be learned in the feature learning process have been described. In the following sections, concentration is put on the learning of audio features. In Section 3.1.1, the audio features used in this study are described in detail. In Section 3.1.2, a method for segmentation of sentences is proposed. And in Section 3.1.3, the process of syllable alignment is reviewed.

3.1.1 Descriptions of Audio Features

In the video recording process, a video of the face of a model containing a speech of a pre-designed transcript is recorded. The speech includes the timing information, namely the duration, of every syllable that must be learned. Without the information, the work of assigning syllable labels to image frames cannot be done, and this makes it impossible to know which syllable an image frame belongs to.

Since the pre-designed transcript is composed of seventeen sentences designed in this study, the speech of every sentence must be learned first, before the learning of the syllables. It is possible to learn the timing information of the syllables directly from the speech of the entire transcript without segmentation of the sentences.

However, the work will take much more time while the length of the input audio

increases. By segmenting sentences in advance, shorter audio parts are used in the learning process, which accelerates the processing speed.

Audio features mentioned above are listed in Table 3.1.

Table 3.1 Descriptions of audio features.

Feature Description Example

Speech of Transcript

A speech that contains the audio data of the entire transcript including seventeen sentences.

A speech that contains the audio data of a single

sentence including several syllables. 好朋友自遠方來。

Speech of Syllable

A speech that contains the audio data of a single syllable.

ㄏㄠ、ㄆㄥ、ㄧㄡ、

ㄗ、ㄩㄢ、ㄈㄤ、ㄌㄞ

3.1.2 Segmentation of Sentences by Silence Features

In the preceding section, the reason for sentence segmentation was explained. In the following sections, a method for sentence segmentation is proposed. A new kind of audio feature called “silence feature” used in the proposed method is introduced first.

3.1.2.1 Idea of Segmentation

In order to segment speeches of sentences automatically, the video recording process is designed to let the model keep silent for predefined periods of time in two situations as defined in the following:

(1) Initial silence: A period of time when the model keeps silent while shaking

his/her head, such as the red part in Fig 3.1.

(2) Intermediate silence: A period of time when the model keeps silent in pauses between sentences, such as the blue parts in Fig 3.1.

If the above-mentioned silence features can be learned, periods of silence can be detected, which means that periods of sentences can be detected, too. After that, the segmentation of speeches of sentences will become an easy job. In the following section, a method is proposed to detect the silence features, along with a sentence segmentation algorithm.

Fig. 3.1 A diagram that shows the audios, the corresponding actions taken, and the silence periods in a recorded video.

3.1.2.2 Segmentation Process

Before the segmentation can begin, the silence features must be learned first. To achieve this goal, the problem of the determination of “silence” must be solved.

Silence here means audio parts recorded while the model does not speak. However, the volume of these parts usually is not zero due to the noise in the environment, so that they cannot be detected by simply searching zero-volume zones.

To decide a volume threshold for distinguishing silent parts from sound ones, the period when the model is shaking his/her head is utilized. Since the model is asked to

keep silent in this period, the recorded volume originates from the noise in the environment. The maximum volume appeared in this period can be viewed as the threshold value. The duration of this period can be known easily because the system operator controls it.

After the threshold value is determined, the silent parts can be found by searching for the ones whose volumes are always smaller than the threshold value.

However, short pauses between syllables in a sentence may be viewed as silences. To solve this problem, lengths of audio parts should be put into consideration, that is, the ones that are not long enough should be discarded. The duration of pauses between sentences are designed to be much longer than that of natural ones between syllables to avoid erroneous detections.

Finally, the silent audio parts can be found. Then, the sound parts can be found and segmented. The entire process of sentence segmentation is described as follows, and a flowchart of this process is shown in Fig. 3.2.

Algorithm 1. Sentence segmentation by silence features.

Input: an audio Atranscript of the entire transcript, a predefined duration Dshake for shaking head, and a predefined duration Dpause for pausing between sentences.

Output: several audio parts of sentences Asentence1, Asentence2, etc.

Steps:

Step 1: Find the maximum volume V appearing in the audio parts within Dshake.

Step 2: Find a continuous audio part A_silence whose volume is always smaller than V and lasts longer than Dpause.

Step 3: Repeat Step 2 until all silent parts are collected.

Step 4: Find a continuous audio part Asentence that are not occupied by any

A_silence.

Step 5: Repeat Step 4 until all sound parts are collected.

Step 6: Break A_transcript into audio parts of sentences.

Fig. 3.3 illustrates an example of the experimental results of the proposed segmentation method. The blue and green parts represent odd and even sentences, respectively. It is shown that the sound parts of the audio are learned correctly.

Find maximum

Fig. 3.2 Flowchart of the sentence segmentation process.

Fig. 3.3 An example of sentence segmentation results. The time of head shaking is 5 seconds, and the time of pausing between sentences is 1 second.

3.1.3 Review of Alignments of Mandarin Syllables

After the segmentation of sentences is done, the timing information of each syllable in a sentence can be learned by speech recognition or alignment techniques.

The alignment ones are also kinds of speech recognition techniques, however, they need to know the syllables spoken in input speeches. Therefore, they produce recognition results with higher accuracy. In this study, a speech alignment technique using the Hidden Markov Model is utilized to learn the timing information of syllables.

The Hidden Markov Model, which can be abbreviated as HMM, is a model for speech recognition and alignment using statistical methods. It is used to characterize the spectral properties of the frames of a speech pattern. In [6], Lin and Tsai adopted a sub-syllable model together with the HMM for recognition of Mandarin syllables.

After the sub-syllable model is constructed, the Viterbi search is used to segment the utterance. Finally, the timing information of every syllable in the input speech is produced.

3.2 Learning of Base Image Sequences

As mentioned in Section 3.1, the silence period for the model to shake his/her head in the video recording process is used to help segment sentences. However, this period is designed to have another function, that is, to help learn base image sequences. In Section 3.2.1, the meaning and use of base image sequences is described. And in Section 3.2.2, a process that utilizes the silence period to learn base image sequences is proposed.

3.2.1 Descriptions

Base image sequences are sequences of base images, while a base image is a facial image onto which some mutable facial features may be pasted to form new images with different facial expressions. For instance, Fig. 3.4(a) shows an example of a base image. After pasting a new mouth image onto the correct position, a new face is produced, as illustrated by Fig. 3.4(b). In the same way, after pasting several new mouth images onto a sequence of base images, an animation of a talking face is produced.

(a) (b) Fig. 3.4 Example of base images. (a) A base image. (b) A base image with a new

mouth pasted on.

As mentioned above, base images provide places for variable facial features to be pasted on. These variable ones normally include eyebrows, eyes, mouths, etc.

However, the mouths are the only kinds of features adopted as variable ones in this study. The eyebrows and eyes are not pasted onto base images; instead, the eyebrows and eyes on the base images are retained to produce animations with more natural eye-blinking actions.

The motion of a head is another kind of feature controlled by the base images.

By inserting several images of a shaking head into the base image sequence, the produced animation can exhibit a speaking person with his/her head shaking. In the

same way, other kinds of head movements such as nodding can be integrated.

Base images control more things in the generated animation. For example, the background, the body, and the hand are all controlled by the base images.

3.2.2 Learning Process

To produce base image sequences, the initial silence period in the video

在文檔中自動化建構虛擬說話人臉與其相關應用之研究 (頁 19-0)