Properties of Animations - Animation Generation Process

Chapter 2 System Overview

2.4 Animation Generation Process

2.4.1 Properties of Animations

To create virtual faces that are capable of improving interfaces between humans and computers, created animations should possess several properties. Firstly, they should have realistic appearances. Interacting with faces that are not realistic is a very

strange and unnatural thing. Secondly, lip movements should be smooth. Since people are familiar with watching others’ lip movements while talking to each other, unnatural movements of lips will be discovered easily. Thirdly, fluent speaking abilities are required. And fourthly, speech and lip movements should be synchronized.

Humans are conscious of asynchronous problems between speech and lip movements.

In this study, created animations by the proposed system possess all of these properties. Firstly, animations are generated from 2D image frames. With good techniques for integration of facial parts, animations look realistic and natural.

Secondly, several methods are proposed to smooth the lip movements. Thirdly, original sound tracks of real people are adopted in final animations, which avoid the problem of unnatural voices. Fourthly, a method is proposed to synchronize speech and lip movements.

2.4.2 Animation Generation Process

The animation generation process requires two inputs: a transcript and its corresponding speech data. First, a person is asked to read a transcript, and the speech is recorded. The process of syllable alignment then extracts the timing information of the syllables in the speech. With the help of timing information of syllables and feature data, proper image frames that are synchronized with speech can be generated.

Finally, animations are generated by composition of images frames and speech data.

Fig 2.6 illustrates a flowchart of the animation generation process.

Speech

Image Frames Syllable Alignment

Syllable Timing Information

Frame Generation

Animation Feature

Database Transcript

Composition Speech Recording

Fig. 2.6 A flowchart of the animation generation process.

Chapter 3 Automatic Learning of Audio

Features and Base Image Sequences 3.1 Learning of Audio Features

In Section 2.3 in the last chapter, four types of features that must be learned in the feature learning process have been described. In the following sections, concentration is put on the learning of audio features. In Section 3.1.1, the audio features used in this study are described in detail. In Section 3.1.2, a method for segmentation of sentences is proposed. And in Section 3.1.3, the process of syllable alignment is reviewed.

3.1.1 Descriptions of Audio Features

In the video recording process, a video of the face of a model containing a speech of a pre-designed transcript is recorded. The speech includes the timing information, namely the duration, of every syllable that must be learned. Without the information, the work of assigning syllable labels to image frames cannot be done, and this makes it impossible to know which syllable an image frame belongs to.

Since the pre-designed transcript is composed of seventeen sentences designed in this study, the speech of every sentence must be learned first, before the learning of the syllables. It is possible to learn the timing information of the syllables directly from the speech of the entire transcript without segmentation of the sentences.

However, the work will take much more time while the length of the input audio

increases. By segmenting sentences in advance, shorter audio parts are used in the learning process, which accelerates the processing speed.

Audio features mentioned above are listed in Table 3.1.

Table 3.1 Descriptions of audio features.

Feature Description Example

Speech of Transcript

A speech that contains the audio data of the entire transcript including seventeen sentences.

A speech that contains the audio data of a single

sentence including several syllables. 好朋友自遠方來。

Speech of Syllable

A speech that contains the audio data of a single syllable.

ㄏㄠ、ㄆㄥ、ㄧㄡ、

ㄗ、ㄩㄢ、ㄈㄤ、ㄌㄞ

3.1.2 Segmentation of Sentences by Silence Features

In the preceding section, the reason for sentence segmentation was explained. In the following sections, a method for sentence segmentation is proposed. A new kind of audio feature called “silence feature” used in the proposed method is introduced first.

3.1.2.1 Idea of Segmentation

In order to segment speeches of sentences automatically, the video recording process is designed to let the model keep silent for predefined periods of time in two situations as defined in the following:

(1) Initial silence: A period of time when the model keeps silent while shaking

his/her head, such as the red part in Fig 3.1.

(2) Intermediate silence: A period of time when the model keeps silent in pauses between sentences, such as the blue parts in Fig 3.1.

If the above-mentioned silence features can be learned, periods of silence can be detected, which means that periods of sentences can be detected, too. After that, the segmentation of speeches of sentences will become an easy job. In the following section, a method is proposed to detect the silence features, along with a sentence segmentation algorithm.

Fig. 3.1 A diagram that shows the audios, the corresponding actions taken, and the silence periods in a recorded video.

3.1.2.2 Segmentation Process

Before the segmentation can begin, the silence features must be learned first. To achieve this goal, the problem of the determination of “silence” must be solved.

Silence here means audio parts recorded while the model does not speak. However, the volume of these parts usually is not zero due to the noise in the environment, so that they cannot be detected by simply searching zero-volume zones.

To decide a volume threshold for distinguishing silent parts from sound ones, the period when the model is shaking his/her head is utilized. Since the model is asked to

keep silent in this period, the recorded volume originates from the noise in the environment. The maximum volume appeared in this period can be viewed as the threshold value. The duration of this period can be known easily because the system operator controls it.

After the threshold value is determined, the silent parts can be found by searching for the ones whose volumes are always smaller than the threshold value.

However, short pauses between syllables in a sentence may be viewed as silences. To solve this problem, lengths of audio parts should be put into consideration, that is, the ones that are not long enough should be discarded. The duration of pauses between sentences are designed to be much longer than that of natural ones between syllables to avoid erroneous detections.

Finally, the silent audio parts can be found. Then, the sound parts can be found and segmented. The entire process of sentence segmentation is described as follows, and a flowchart of this process is shown in Fig. 3.2.

Algorithm 1. Sentence segmentation by silence features.

Input: an audio Atranscript of the entire transcript, a predefined duration Dshake for shaking head, and a predefined duration Dpause for pausing between sentences.

Output: several audio parts of sentences Asentence1, Asentence2, etc.

Steps:

Step 1: Find the maximum volume V appearing in the audio parts within Dshake.

Step 2: Find a continuous audio part A_silence whose volume is always smaller than V and lasts longer than Dpause.

Step 3: Repeat Step 2 until all silent parts are collected.

Step 4: Find a continuous audio part Asentence that are not occupied by any

A_silence.

Step 5: Repeat Step 4 until all sound parts are collected.

Step 6: Break A_transcript into audio parts of sentences.

Fig. 3.3 illustrates an example of the experimental results of the proposed segmentation method. The blue and green parts represent odd and even sentences, respectively. It is shown that the sound parts of the audio are learned correctly.

Find maximum

Fig. 3.2 Flowchart of the sentence segmentation process.

Fig. 3.3 An example of sentence segmentation results. The time of head shaking is 5 seconds, and the time of pausing between sentences is 1 second.

3.1.3 Review of Alignments of Mandarin Syllables

After the segmentation of sentences is done, the timing information of each syllable in a sentence can be learned by speech recognition or alignment techniques.

The alignment ones are also kinds of speech recognition techniques, however, they need to know the syllables spoken in input speeches. Therefore, they produce recognition results with higher accuracy. In this study, a speech alignment technique using the Hidden Markov Model is utilized to learn the timing information of syllables.

The Hidden Markov Model, which can be abbreviated as HMM, is a model for speech recognition and alignment using statistical methods. It is used to characterize the spectral properties of the frames of a speech pattern. In [6], Lin and Tsai adopted a sub-syllable model together with the HMM for recognition of Mandarin syllables.

After the sub-syllable model is constructed, the Viterbi search is used to segment the utterance. Finally, the timing information of every syllable in the input speech is produced.

3.2 Learning of Base Image Sequences

As mentioned in Section 3.1, the silence period for the model to shake his/her head in the video recording process is used to help segment sentences. However, this period is designed to have another function, that is, to help learn base image sequences. In Section 3.2.1, the meaning and use of base image sequences is described. And in Section 3.2.2, a process that utilizes the silence period to learn base image sequences is proposed.

3.2.1 Descriptions

Base image sequences are sequences of base images, while a base image is a facial image onto which some mutable facial features may be pasted to form new images with different facial expressions. For instance, Fig. 3.4(a) shows an example of a base image. After pasting a new mouth image onto the correct position, a new face is produced, as illustrated by Fig. 3.4(b). In the same way, after pasting several new mouth images onto a sequence of base images, an animation of a talking face is produced.

(a) (b) Fig. 3.4 Example of base images. (a) A base image. (b) A base image with a new

mouth pasted on.

As mentioned above, base images provide places for variable facial features to be pasted on. These variable ones normally include eyebrows, eyes, mouths, etc.

However, the mouths are the only kinds of features adopted as variable ones in this study. The eyebrows and eyes are not pasted onto base images; instead, the eyebrows and eyes on the base images are retained to produce animations with more natural eye-blinking actions.

The motion of a head is another kind of feature controlled by the base images.

By inserting several images of a shaking head into the base image sequence, the produced animation can exhibit a speaking person with his/her head shaking. In the

same way, other kinds of head movements such as nodding can be integrated.

Base images control more things in the generated animation. For example, the background, the body, and the hand are all controlled by the base images.

3.2.2 Learning Process

To produce base image sequences, the initial silence period in the video recording process is utilized. The model is asked to shake his/her head during this period to simulate natural shaking of heads while speaking, and the image frames recorded during this period are used as base images. Certainly, all image frames of the recorded video can be used as base images. However, there are some drawbacks.

Firstly, since the actions of eyes and eyebrows originate from the base images, namely, all the image frames of the recorded video, the model must keep his/her eyes

“natural” during the entire recording process, which is a very tiring job. And secondly, since the model is asked to pause awhile between sentences, generated base image sequences will exhibit this behavior, which is somewhat unacceptable. To avoid these drawbacks, only the image frames recorded during the head-shaking period are used as base images in this study. The period is short and no pause exists in it.

To generate a sequence of base images, an initial frame is selected first. And then, a traverse direction, either forward or backward, is selected. Starting from the initial frame, the frames met along the traverse direction are added to the sequence. In order to create animations with changeful head motions, the initial frame and the traverse direction is randomly selected for every session of animation. Besides, since desired animations may require more image frames than the number of total base images, the images must be used repeatedly. One way to solve this problem is to reverse the traverse direction when reaching the first or the last base image. Fig. 3.5 illustrates

this situation.

Fig. 3.5 A diagram that shows the generation process of base image sequences.

The entire process of generating base image sequences is described as follows, and a flowchart of this process is shown in Fig. 3.6.

Algorithm 2. Learning process of base image sequences.

Input: a sequence of image frames I = {I1, I2, …, IM} of the recorded video in the head-shaking period, and the amount of desired base images N.

Output: a sequence of base images B = {B1, B2, …, BN}.

Steps:

Step 1: Randomly select an initial frame Iinitial in I.

Step 2: Randomly select an initial direction, either forward or backward.

Step 3: Add the current frame to B.

Step 4: Stop learning, if the number of frames in B equals to N.

Step 5: Reverse the direction if the current frame is I1 or IM. Step 6: Advance to the next frame along the selected direction.

Step 7: Repeat Steps 3 through 6.

3.3 Experimental Results

In this section, some experimental results of the proposed methods described in

this chapter are shown. Firstly, Fig. 3.7 shows the entire audio of a transcript, and Fig.

3.8 shows the sentence segmentation result. The 17 speaking parts of the audio, which are represented in blue and green colors, are detected successfully.

Select initial

Fig. 3.6 Flowchart of the learning process of base image sequences.

Fig. 3.7 An example of entire audio data of a transcript. The duration of head shaking is 5 seconds, and the duration of pausing between sentences is 1 second.

Fig. 3.8 The audio data in Fig. 3.7 is segmented into 17 parts.

Secondly, Fig. 3.9 shows an audio containing a Mandarin sentence, and Fig. 3.10 shows the result of syllable alignment. Durations of syllables are shown in blue and green colors. In Fig. 3.11 and 3.12, another Mandarin sentence and its corresponding result of syllable alignment are shown.

Fig. 3.9 An audio that contains a Mandarin sentence “好朋友自遠方來”.

Fig. 3.10 The result of syllable alignment of the audio in Fig. 3.9.

Fig. 3.11 An audio that contains a Mandarin sentence “熟能生巧”.

Fig. 3.12 The result of syllable alignment of the audio in Fig. 3.11.

Thirdly, Fig. 3.13 shows a base image sequence produced with the proposed method.

Fig. 3.13 A base image sequence produced with the proposed method.

3.4 Summary and Discussions

In this chapter, an automatic method for sentence segmentation is proposed. The method works well in silent environments. The method is also workable in environments with constant noise, such as the noise of fans and cooling systems. The segmentation helps accelerate the subsequent work of syllable alignment. Besides, a method for generating base image sequences by utilizing the period of head shaking is proposed. The sequences generated are different for every session, which helps prepare varied background images for animations.

Chapter 4 Automatic Learning of Facial Features

4.1 Introduction

To create an animation of a speaking person, syllables spoken are collected first, and then visemes corresponding to the syllables must be “pasted” onto the base image sequence. The visemes, namely, the mouth images, should be pasted onto correct positions of faces; otherwise the generated animation will look strange. As shown in Fig. 4.1, pasting on incorrect positions leads to unacceptable results.

(a) (b) (c) Fig. 4.1 Example of base images. (a) A base image. (b) The base image with a new

mouth pasted onto the correct position. (c) The base image with a new mouth pasted onto an incorrect position.

In order to decide the correct positions for the mouth images to be pasted on, three types of methods have been tried in this study. The first one is to measure the positions manually. Obviously, the positions obtained may be very precise. However, it is not suitable to perform this work on many frames. The second is to plaster the face with some marks, and then the positions can be detected easily and automatically.

However, this method bears the disadvantage of plastering extra marks on the face.

The third method is to measure the positions by face recognition techniques on every frame. This method is fully automatic, however, results of recognition are often not stable enough due to slight variations in lighting. The slight movements of muscles under the skin also may affect the recognition results significantly, though human eyes may not notice them.

In this study, a method that integrates the second and the third method mentioned above is proposed. A face recognition technique using a knowledge-based approach is used to learn the positions of facial features for the first frame. The technique is reviewed in Section 4.2. Spatial relations between these features, which keep invariable for an identical face, are noted. Then, a kind of facial feature is used as a sort of mark, and this “natural” mark, which is called the base region in this study, can be detected by image-matching techniques. Finally, the positions of other facial features excluding the mark can be calculated according to the spatial relations. The process is illustrated in Fig. 4.2.

One advantage of this method is that the results of matching are more stable while the mark keeps unchanged for every frame. Another advantage is that the image matching techniques can even be applied to rotated faces, which is discussed in Section 4.3.2.

To select a proper facial feature to be used as the base region, its invariance is important. Among those facial features listed in Fig. 4.3, the nose is the only one that keeps an invariant shape while the face is speaking. The eyebrows may move slightly due to expressions and the eyes may blink casually. The shapes of the mouth and the jaw change obviously on a speaking face. Therefore, the nose is selected as the base region in this study.

Feature

Fig. 4.2 Flowchart of the learning process of facial features.

(a) (b) (c) (d) (e) Fig. 4.3 Facial features. (a) The eyebrows. (b) The eyes. (c) The nose. (d) The mouth.

(e) The jaw.

4.2 Review of Knowledge-Based Face Recognition Techniques

Knowledge-based face recognition techniques use the common knowledge of

facial features to detect their positions. An example of the knowledge is that eyes on a face have similar shapes. Another example is that eyebrows have similar shapes while they are always above the eyes.

In this study, relations and shapes of facial feature are used as the knowledge to learn their positions. First, the skin part of a facial image is found by color thresholding. Facial features are filtered according to the feature properties and relations. Edges of the image are used to find the positions of the facial features more precisely.

4.3 Learning of Base Regions

In this section, the proposed learning process of base regions is described in detail. After the position of the base region of the first frame is determined using the technique described in Section 4.2, the process is performed on other frames to learn the positions of the base regions of them. The positions of facial features can then be determined easily.

4.3.1 Review of Image Matching Techniques

In Section 4.1, it is mentioned that the base region positions of the frames other than the first one can be determined using image matching techniques. These techniques are used to find the position of a pattern image inside a base image. Fig.

在文檔中自動化建構虛擬說話人臉與其相關應用之研究 (頁 28-0)