For facial animation, generating life-like facial expression is a challenging problem. Here, we will review some previous researches.
Facial animation can be roughly divided into two groups according to their basic structures. The first one is image-based facial animation, and the other is 3D model-based facial animation. Image-based approaches employ one or several real facial images to synthesize novel images. Because of the use of real images, image-based approaches can reach a photorealistic quality while there are sufficient samples in the data pool. The drawbacks of image-based approaches are difficult to relight, and to alter view directions. Image-based approaches are usually used for close-up animation.
In image-based approaches, Beier and Neely proposed a feature-based morphing approach [1]. This approach first used in Michael Jackson’s MTV “black and white”
and got a success. Now it is widely used in several image-based animations. In video rewrite [5], Bregler et al. used existing footage to synthesize a new video sequence.
The system records mouth images in the training footage to match the phoneme sequence of the new audio track. Using this approach can automatically create new video of a person pronouncing words that he/she did not speak. Another approach to create talking heads was proposed by Ezzat in 2002, [7]. Since Bregler’s system can not produce the mouth that isn’t in their database, Ezzat et al. developed a variant of the multidimensional morphable model (MMM) to synthesize previously-unseen mouth configuration from the set of mouth prototypes.
Figure 3 : Bregler’s video rewrite system. [5]
Expression ratio image [16] is another method to generate expression with details. They captured the illumination change of one person’s expression from images (called expression ratio image (ERI)). With geometric warping first, ERI can be mapped to other person’s face image. Tu et al. [22] adapted the ERI approach for 3D model-based animation. They transformed ERI into normal maps. Then the normal map sequence can be applied to 3D models.
In contrast to image-based facial animation, 3D model-based animation is suitable to use in a 3D environment. However, it needs a lot of control parameters to deal with modeling, facial motion, and even surface reflection.
Waters [23] proposed pioneering face models with physical muscle (physical-based approach). They simulated the deformation of face muscle and tissue by utilizing an approximate model and physical model. The face model can be controlled by a set of parameters. To generate desired expressions, users need to adjust the value of parameters to control the face muscle. Their approach can simulate muscle and face tissue, but simulating facial details (such as subtle wrinkles and creases) are not efficient and are difficult to control. In Sifakis’ research [21], they proposed an anatomically accurate face model controlled by muscle activations and kinematic bone degrees of freedom. Their novel algorithm can automatically compute control values for sparse motion capture maker input.
While physical based approaches simulate facial expression by physical model, the other researches extract the information from images or synthesize dynamic texture from images. Making face [12] takes image sequences of six camera views of an actress’ face where 182 makers are pasted. Then the 3D deformation of the face can be accurately tracked by the large set of makers. At the same time, the image sequence (after removal and inpainting of marker regions) is used to create a texture map sequence for the 3D polygonal face model. The approach can result in a life-like performance, but it still has the weakness of texture map –difficult to relight and retarget.
Based on a large set of 3D scanned face examples, Blanz et al. [3] built a morphable head model. First, they transformed the shape and texture of the examples into a vector space representation. Then, by the linear combination of prototypes, new faces and expressions can be modeled. This approach is useful for model textured 3D faces. In 2003 [2], they further transferred facial expressions by computing the difference between two scans of the same person in a vector space of faces. This approach could extract the difference between a neutral face and an expressed face (ex.
a smiled face). Then, the extracted expression could be added to images.
Figure 4: Extract an expression from the vector space of faces, and add it to the neutral face. [2]
Although the range data of a static face can be captured by laser scanners, the
facial expression of a dynamic face can not be captured by laser scanners. Zhang et al.
[25] proposed a structure light approach to capture the dynamic variation of a face.
Their system utilizes two projectors to project stripe patterns and six cameras.
Through the deformation of stripe patterns, the depth map can be calculated. Besides, they presented a keyframe interpolation technique to synthesize in-between video frame and a controllable face model.
Figure 5: Structure light image (a) and range image captured (b) by L. Zhang’s system. [25]
A geometry-driven approach proposed by Zhang et al. [27] synthesizes facial expression through the relation between positions of specific feature points and expressions. They utilized a vector space based on captured expression prototypes.
Then with new positions of feature points input by users, a new 2D expression can be synthesized by solving an optimization problem in this vector space. We adapt their approach to extract expression details in normal map form (details will be introduced in other chapters later).
The blendshape (shape interpolation) method is a popular approach to computer animation (ex. the “Gollum” model in movie “Lord of the Rings”). However the cost of blendshape face animation is considerable. Therefore, Deng et al. [6] propose an approach to save production time. Their research is a semi-automatic technique to directly animate popularized 3D blendshape face models by mapping facial motion
capture data spaces to 3D blendshape faces paces. They defined reference mocap-video pairs with motions of human subjects and corresponding video frames.
Then, they manually tune the blendshape weights to match each reference mocap-video pairs. Finally the blendshape weights to any new facial motion capture frame can be derived from manual-tuned weights and corresponding reference mocap-video pairs. The system can save the time of manually tuning blendshape weights.
Figure 6: The overview of Deng’s blendshape animating system [6].
In addition to facial details due to different expressions, Golovinskiy et al [11].
develop a statistical model for synthesizing detailed facial geometry due to aging.
They acquire high-resolution face geometry of people across a wide range of ages, genders, and races. For each scan, they separate the skin surface details from a smooth base mesh using displaced subdivision surfaces. Then, they analyze the resulting displacement maps using the texture analysis/ synthesis framework of Heeger and Bergen, adapted to capture statistics that vary spatially across a face. Finally, they use the extracted statistics to synthesize plausible details on face meshes of arbitrary subjects.
Figure 7: Synthesis of facial detail by Golovinskiy et al [11]. (a) low-resolution mesh obtained from a commercial scanner. (b) synthesize detail on (a) using statistics extracted from high-resolution meshes in their database. (c) age the face by adjusting the statistics to match those of an elderly man.
In addition to facial animation, our research is related to extract the surface variation from images. Here, we introduce some research related to extract surface variation. Horn et al. [13] proposed that the surface normals can be recovered from the intensity variations of an image. He took an optimization method that iteratively minimized errors. Fang et al. [8] adapted Horn’s approach. They simply utilize Lambertian reflection model to extract the normal map from single image. Their approach spends less time and doesn’t need expensive equipment.