Summary and Discussions - Automatic Learning of Audio Features and Base Image

Chapter 3 Automatic Learning of Audio Features and Base Image

3.4 Summary and Discussions

In this chapter, an automatic method for sentence segmentation is proposed. The method works well in silent environments. The method is also workable in environments with constant noise, such as the noise of fans and cooling systems. The segmentation helps accelerate the subsequent work of syllable alignment. Besides, a method for generating base image sequences by utilizing the period of head shaking is proposed. The sequences generated are different for every session, which helps prepare varied background images for animations.

Chapter 4 Automatic Learning of Facial Features

4.1 Introduction

To create an animation of a speaking person, syllables spoken are collected first, and then visemes corresponding to the syllables must be “pasted” onto the base image sequence. The visemes, namely, the mouth images, should be pasted onto correct positions of faces; otherwise the generated animation will look strange. As shown in Fig. 4.1, pasting on incorrect positions leads to unacceptable results.

(a) (b) (c) Fig. 4.1 Example of base images. (a) A base image. (b) The base image with a new

mouth pasted onto the correct position. (c) The base image with a new mouth pasted onto an incorrect position.

In order to decide the correct positions for the mouth images to be pasted on, three types of methods have been tried in this study. The first one is to measure the positions manually. Obviously, the positions obtained may be very precise. However, it is not suitable to perform this work on many frames. The second is to plaster the face with some marks, and then the positions can be detected easily and automatically.

However, this method bears the disadvantage of plastering extra marks on the face.

The third method is to measure the positions by face recognition techniques on every frame. This method is fully automatic, however, results of recognition are often not stable enough due to slight variations in lighting. The slight movements of muscles under the skin also may affect the recognition results significantly, though human eyes may not notice them.

In this study, a method that integrates the second and the third method mentioned above is proposed. A face recognition technique using a knowledge-based approach is used to learn the positions of facial features for the first frame. The technique is reviewed in Section 4.2. Spatial relations between these features, which keep invariable for an identical face, are noted. Then, a kind of facial feature is used as a sort of mark, and this “natural” mark, which is called the base region in this study, can be detected by image-matching techniques. Finally, the positions of other facial features excluding the mark can be calculated according to the spatial relations. The process is illustrated in Fig. 4.2.

One advantage of this method is that the results of matching are more stable while the mark keeps unchanged for every frame. Another advantage is that the image matching techniques can even be applied to rotated faces, which is discussed in Section 4.3.2.

To select a proper facial feature to be used as the base region, its invariance is important. Among those facial features listed in Fig. 4.3, the nose is the only one that keeps an invariant shape while the face is speaking. The eyebrows may move slightly due to expressions and the eyes may blink casually. The shapes of the mouth and the jaw change obviously on a speaking face. Therefore, the nose is selected as the base region in this study.

Feature

Fig. 4.2 Flowchart of the learning process of facial features.

(a) (b) (c) (d) (e) Fig. 4.3 Facial features. (a) The eyebrows. (b) The eyes. (c) The nose. (d) The mouth.

(e) The jaw.

4.2 Review of Knowledge-Based Face Recognition Techniques

Knowledge-based face recognition techniques use the common knowledge of

facial features to detect their positions. An example of the knowledge is that eyes on a face have similar shapes. Another example is that eyebrows have similar shapes while they are always above the eyes.

In this study, relations and shapes of facial feature are used as the knowledge to learn their positions. First, the skin part of a facial image is found by color thresholding. Facial features are filtered according to the feature properties and relations. Edges of the image are used to find the positions of the facial features more precisely.

4.3 Learning of Base Regions

In this section, the proposed learning process of base regions is described in detail. After the position of the base region of the first frame is determined using the technique described in Section 4.2, the process is performed on other frames to learn the positions of the base regions of them. The positions of facial features can then be determined easily.

4.3.1 Review of Image Matching Techniques

In Section 4.1, it is mentioned that the base region positions of the frames other than the first one can be determined using image matching techniques. These techniques are used to find the position of a pattern image inside a base image. Fig.

4.4 shows the block diagram of common image matching techniques. By shifting the position of the pattern image (or the base image on the contrary), several measures can be calculated according to a pre-designed formula. At last, the position corresponding to the best measure is adopted.

Measurement

Fig. 4.4 Block diagram of common image matching techniques.

It is obvious that the formula affect the results severely. In an environment with controlled lighting, a formula that calculates the “Euclidean distance” of two images is sufficient. For example, Equation (4.1) below is used in this study to calculate the Euclidean distance between the colors of two images:

∑∑

⁻ ⁺ ⁻ ⁺ ⁻

An example of results of image matching using Equation (4.1) is illustrated in Fig. 4.5. Fig. 4.5(a) shows the first frame of a recorded video, and the blue block on it is the base region detected using the techniques described in Section 4.2. Fig. 4.5(b) shows the 2617^th frame of the video, and the base region position of it is calculated using the image matching technique mentioned above.

(a) (b)

Fig. 4.5 Example of image matching results. (a) The first frame of a video. The base region position is (346, 280). (b) The 2617^th frame of the video. The base region position is (353, 283).

4.3.2 Learning by Image Matching with Sub-Pixel Precision

The image matching technique used in Section 4.3.1 is proper for finding the base region positions of a face, even if the face is shaking. Table 4.1 shows an example of base region positions of a sequence of images. It is shown that the minimum unit of a base region position is a pixel. However, the face normally does not shake in the way of shifting its position by one pixel for successive frames suddenly, instead, smoothly. Therefore, to find the positions with sub-pixel precision is useful for high-quality animations. Examples of positions with sub-pixel precision are (346.5, 280.5) and (353.3, 283.6).

To find a position with sub-pixel precision, a pattern image needs to be shifted by a distance shorter than a pixel, and then the image matching technique can be performed on the shifted pattern image. To accomplish the job of shifting, the continuous properties of facial images are utilized.

In [8], Gonzalez and Woods illustrated the process of acquiring digital images from sensors. The image acquired from the sensors is continuous with respect to the x- and y-coordinates, and also in amplitude. The coordinate values and amplitude values are sampled and quantized into digital forms, respectively. The situation is shown in Fig. 4.6. Pink arrows in Fig. 4.6 indicate positions between pixels. If the amplitude values of these positions can be known, the image matching technique can

be performed on these values, just like the image is shifted within a pixel.

Fig. 4.6 A diagram of converting a continuous image into a digitized form.

Since the face images are continuous, it is reasonable to assume that the amplitude values, namely, the color values, between two adjacent pixels approximate the values of these two pixels. In this study, the technique of bilinear interpolation is used to generate these values. Fig. 4.7 illustrates this technique. The color value of P’

is determined in proposition to the color values of the nearest four pixels P₁ through P4 using following equations:

A₁ =∣(x’﹣x₁) (y’ ﹣y₁)∣, A₂ =∣(x’﹣x₂) (y’ ﹣y₂)∣;

A3 =∣(x’﹣x3) (y’ ﹣y3)∣, A4 =∣(x’﹣x4) (y’ ﹣y4)∣; (4.2) P’ = (A₁P₄+ A₂P₃+ A₃P₂+ A₄P₁) / (A₁+ A₂+ A₃+ A₄). (4.3)

Fig. 4.7 A diagram of the adopted bilinear interpolation technique.

A method that performs image matching with sub-pixel precision using above-mentioned ideas and techniques is proposed and described as follows. Firstly, the pattern image, namely, the image of the base region, and the base image are enlarged with a predefined ratio using the bilinear interpolation technique. The color values of pixels that have no corresponding pixels in the original image are filled with interpolated values. Secondly, the image matching technique described in Section 4.3.1 is applied on the enlarged pattern and the base image to find the best position of the base region. Finally, the position is shrunk back according to the predefined ratio.

The algorithm of the image matching with sub-pixel precision is described as follows.

Algorithm 3. Image matching with Sub-Pixel precision.

Input: a pattern image Ipattern, a base image Ibase, and a predefined ratio r.

Output: the position P(x, y) of Ipattern in Ibase. Steps:

Step 1: Enlarge Ipattern r times larger to get a new image IpatternL. Step 2: Enlarge I_base r times larger to get a new image I_baseL.

Step 3: Use the image matching technique to find the position P’(x’, y’)of I_patternL in I_baseL.

Step 4: Divide x’ by r to get x.

Step 5: Divide y’ by r to get y.

An example of results of the proposed method is shown in Fig. 4.8. The ratio value is 2 in this example.

4.3.3 Handling Rotated Faces

In the video recording process, a model is asked to shake his/her head for a period of time, so that generated base image sequences can exhibit a speaking face with natural shaking. Besides, the face of the model may not always be straight due to his/her speaking habit. However, the image matching technique cannot be applied effectively on those shaking faces because the base regions are “rotated”. In Section 4.3.3.1, problems caused by rotated faces are described. In Section 4.3.3.2, a modified image matching method that is suitable to be applied to the rotated faces is proposed.

(a) (b) Fig. 4.8 Example of results of image matching with sub-pixel precision. (a) The

2597^th frame of a video. The base region position is (354.5, 284.0). (b) The 2598^th frame of the video. The base region position is (354.5, 284.5).

4.3.3.1 Problems of Rotated Faces

The image matching technique described in Section 4.3.1 is effective to find the position of a pattern image in a base image. However, this is true only under the assumption that the pattern part of the base image is very similar to the pattern image.

For rotated faces, the pattern parts, namely, the base regions in this study, are rather different from the base region of the first frame. Therefore, results of image matching

are often not quite precise.

Another problem arises when the rotated angle of a rotated face is not known even if the position of the base region is detected correctly. Fig. 4.9 shows the problem. As described in Section 4.1, positions of facial features are calculated according to the base region position and spatial relations. For a straight face like Fig.

4.9(a), the positions of facial features can be calculated correctly. However, for a rotated face like Fig. 4.9(c), the positions of facial features cannot be calculated correctly only with the help of the spatial relations even if the base region position is right. To determine the positions of facial features correctly on a rotated face, the rotated angle must be found.

(a) (b) (c) Fig. 4.9 A diagram that shows the problem of rotated faces. (a) A straight face with its

mouth position determined correctly by the spatial relation. (b) A rotated face. (c) A rotated face with its mouth position determined incorrectly by the spatial relation.

4.3.3.2 Image Matching on Rotated Faces

To find the rotated angle of a rotated face, some extra work is added to the

original image matching method. As shown in Fig. 4.10, the base region image is rotated first to generate several rotated versions. All of these rotated images as used as pattern images. And then, the image matching technique is applied. Finally, the best position and rotated angle of the base region is learned. The algorithm of this process is described as follows.

Algorithm 3. Image matching on rotated faces.

Input: a pattern image Ipattern, and a base image Ibase. Output: the position and rotated angle of Ipattern in Ibase. Steps:

Step 1: Rotate Ipattern with an incremental series of degrees and get a set of new images I_patternR.= {I_patternR1, I_patternR2, …, I_patternRN }.

Step 2: Select an image IpatternR’ in IpatternR as the pattern image.

Step 3: Perform image matching on I_patternR’ and I_base and record the measurement value.

Step 4: Repeat steps 2 through 3 until all images in I_patternR are used ever.

Step 5: Output the position and rotated angle corresponding to the minimum measure value.

Fig. 4.10 Flowchart of the proposed image matching method on rotated faces.

An example of results of image matching on rotated faces is shown in Fig. 4.11.

(a) (b) Fig. 4.11 Example of results of image matching on rotated faces. (a) The rotated

angle is 6 degrees clockwise. (b) The rotated angle is –1 degree clockwise.

4.3.4 Learning by Image Matching with Sub-Pixel Precision on Rotated Faces

The techniques mentioned in Sections 4.3.2 and 4.3.3 can be combined together to learn the positions and rotated angles on rotated faces with sub-pixel precision. Fig.

4.12 illustrates the combined process, in which green blocks represent the work of matching on rotated faces, and blue blocks represent the work of matching with sub-pixel precision.

4.3.5 Correcting Erroneous Matching Results

Although the results of the image matching technique are stable, however, some errors still exist due to unavoidable changes in lighting. The variations of shadows on faces also affect the accuracy of the results.

1*rotated

Fig. 4.12 Flowchart of image matching with sub-pixel precision on rotated faces.

For instance, Fig. 4.13(a) shows an example of the result obtained using the proposed method in Section 4.3.4. Corresponding trajectories of the x-axis, the y-axis and the rotated angle are shown in Fig. 4.13(b). Red parts of the trajectories represent situations that the position of the base region suddenly moves forward along a direction and then back. For example, the x-axis of the base region of frame 2 is 344.0, and then it moves right to 344.5 on frame 3, and then it moves left to 344.0 on frame 4.

This kind of situation is not normal since the face often shakes smoothly.

To solve this problem, a method is proposed to correct the erroneous values. It simply deals with the following situation: the position of the base region suddenly moves forward along a direction and then back in three frames. In this situation, the base region position of the 2^nd frame is corrected with the center of the base regions positions of the 1^st and the 3^rd frame.

Frame X Y Angle Fig. 4.13 Example of erroneous matching results. (a) The x- and y-axis and rotated

angle detected by image matching techniques. (b) The trajectories of the x-axis, the y-axis, and the rotated angle.

The entire algorithm of correcting erroneous matching results is described as follows:

Algorithm 4. Correcting erroneous matching results.

Input: a sequence of triples (x_i, y_i, A_i) = { (x₁, y₁, A₁), (x₂, y₂, A₂), …, (x_N, y_N, AN)}, where xi, yi, and Ai represent the x-axis, the y-axis and the rotated angle of the base region of frame i, respectively.

Output: a sequence of corrected triples (xi, yi, Ai) = { (x1, y1, A1), (x2, y2, A2), …,

Step 3: Set ycurrent to (ycurrent-1＋y_current+1)/2, if (ycurrent－y_current-1)(ycurrent－

y_current+1) > 0.

Step 4: Set Acurrent to (Acurrent-1＋Acurrent+1)/2, if (Acurrent－Acurrent-1)( Acurrent－ A_current+1) > 0.

Step 5: Add current by 1.

Step 6: Repeat Steps 1 through 5 until current is larger than N.

Fig. 4.14 shows the result of Fig. 4.13(a) after correcting erroneous matching results. The trajectories shown in Fig. 4.14(b) are smoother and more natural.

Frame X Y Angle

1 344.0 280.0 0

2 344.0 280.5 0

3 344.0 280.5 0

4 344.0 281.0 0

5 344.0 281.5 0

(a) (b) Fig. 4.14 Example of corrected matching results. (a) The values in Fig. 4.13(a) are

corrected. (b) The corrected trajectories of the x-axis, the y-axis, and the rotated angle.

4.4 Experimental Results

In this section, some experimental results of the proposed methods described in this chapter are shown. Fig. 4.15 shows a sequence of image frames in a recorded video. Positions and rotated angles of base regions detected using the method

proposed in Section 4.3.4 are listed in Table 4.1(a), and the corrected values using the method proposed in Section 4.3.5 are listed in Table 4.1(b).

Fig. 4.15 A sequence of image frames in a recorded video.

Table 4.1 Positions and rotated angles of base regions of frames in Fig. 4.15. (a) Uncorrected values. (b) Corrected values.

Frame X Y Angle

4.5 Summary and Discussions

In this chapter, the automation of facial feature learning was emphasized. Since the positions of these features can be calculated according to the position of the base region, detection of the base region automatically becomes very important. The image matching technique may be used to do this job. Proposed methods modify the technique so that it can be performed on rotated faces with sub-pixel accuracy. A method is also proposed to correct erroneous matching results.

Chapter 5 Virtual Talking Face Animation Generation

5.1 Introduction

In Chapters 3 and 4, the audio features, the base image sequences, and the facial features are collected using the proposed methods. With the help of these features, virtual talking face animations with synchronized utterances can be generated. Fig.

5.1 shows a block diagram for the proposed animation generation process. “Proper”

frames are generated according to the timing information of the input audio and the feature information in the viseme database. Finally the input audio and the generated frames are combined together to produce an animation. It is obvious that generating

“proper” frames is a very important work. Badly generated frames may lead to asynchronous problems between the audio and the frames, or unnatural animations.

Fig. 5.1 Block diagram of proposed animation generation process.

Fig. 5.2 illustrates the frame generation process proposed by Lin and Tsai [6].

Firstly the timing information of the syllables of an input speech is obtained. This timing information is used to decide the number of frames to preserve every syllable and the pauses between syllables. The starting frame for every syllable is calculated by accumulating the number of frames of preceding syllables and pauses. Secondly, the corresponding visemes of syllables are prepared. Since the number of frames of a syllable in the viseme database may not equal the number of frames preserved, an algorithm for frame increase and decrease is used. Thirdly, the mouth images of visemes are pasted onto the base images. Finally, transition frames between syllables are replaced with several middle frames to produce smoother animations.

Fig. 5.2 Diagram of frame generation process proposed by Lin and Tsai [6].

The above-mentioned process is effective in generating proper frames for synchronized and smooth animations. However, some problems still exist. In the following sections of this chapter, the problems are illustrated, and some methods are proposed to solve these problems in order to produce higher-quality animations.

5.2 Synchronization Between Frames And Speeches

In Lin and Tsai [6], the starting frame for every syllable is calculated by accumulating the number of frames of the preceding syllables and pauses. To obtain the number of frames of a syllable or pause, the length of it is multiplied by a frames-per-second constant. The constant controls the number of frames appearing within a second. For a standard NTSC video, the constant is set to 29.97.

Since the length of a syllable or a pause is a floating-point number, the calculated number of frames is also a floating-point number. However, the number of frames can only be an integer. Discarding the fraction part leads to a small error. Errors are accumulated and affect the synchronization between the audio and the frames.

To solve this problem, a method of re-synchronization on every syllable is

在文檔中自動化建構虛擬說話人臉與其相關應用之研究 (頁 43-0)