1. INTRODUCTION
1.4 O VERVIEW AND O RGANIZATION
ure from (i)3D position and motion estimation, (ii)marker tracking in video clips to (iii)facial animation is proposed.
First
nd it also manifests correlations between chap
that estimates 3D position from real versus mirrored-point correspondence.
Sinc
atically track markers’ 3D trajectories under a no
, and the proc
sed and improved by Cyberlink, Corp. as a commercial product
“Talking show” in 2000 [TALK].
In this dissertation, a complete proced
of all, chapter 2 presents state-of-the-art researches and publications in related areas, which are composed of 3D structure reconstruction, motion tracking and human face synthesis. The benefits and drawbacks of methods derived from different conceptions are also mentioned.
Then, components of the procedure are proposed in following chapters. Figure 1.5 is a flow chart of the proposed work, a
ters. The complete procedure and its components are briefly introduced as follows.
Chapter 3 is the core of the proposed work, where we deduce and propose an algorithm
e position extraction of projected markers cannot be entirely exact, measurement noise will degrade the results produced by the algorithm. To evaluate effects of the measurement noise, expected error of the proposed algorithm is calculated theoretically in Section 3.3.
The facial motion tracking procedures are proposed for two lighting conditions.
Chapter 4 describes how we semi-autom
rmal light condition. The left tracking flow in Figure 1.3 shows this procedure.
The concept and the equipment setup and are specified in Section 4.1 and 4.2, and block-matching-based feature extraction is presented in section 4.3. At last, this section also proposed our approach to estimate facial motion trajectories from 3D candidate sequences reconstructed by the algorithm proposed in Chapter 3.
Chapter 5 presents a fully automatic procedure for tracking a large quantity of fluorescent markers under the blacklight ultraviolet (UV) lighting condition
ess corresponds to the right tracking flow in Figure 1.3. Similar to the procedure for the normal light condition, Section 5.1 introduces the concept and Section 5.2
mentions equipment setting and the feature extraction process. Since fluorescent markers can emit luminescence when they are illuminated by UV fluorescent light, these special markers are prominent in video. Therefore, the process of fluorescent marker extraction can be more reliably achieved by general computer vision methodology, including conditioning, color labeling, grouping, extracting, etc. When tracking such numerous markers, false alarms and missing problems are much more seriously. How to utilize spatial and temporal correlations of the numerous markers for fully automatic tracking is proposed in Section 5.3.
To compare the proposed method with general-purpose stereovision approaches, we conducted computer simulations and actual experiments for different control facto
ll set of proto
rs. The experiment results and discussions are presented in Chapter 6.
In Chapter 7, our work for face synthesis is proposed. We propose a speech-driven facial animation system that animates based on a sma
type facial motion parameters in Section 7.2. The system is further developed as a low-bit-rate talking head to efficiently stream facial animation over Internet.
Section 7.3 mentions how we construct a synthetic face cloning a real one. The way we use estimated 3D FAPs to drive facial animation is presented in Section 7.4.
Finally, Chapter 8 concludes my research and the future work is described.
&
Synthesis Analysis
Web-enabled talking head {Section 7.2}
Speech-driven facial animation
{Section 7.2}
Face synthesis {Section 7.3}
Performance-driven facial animation {Section 7.4}
Experiment discussion {Chapter 6}
Expected error analysis {Section 3.3}
Speech analysis 3D FAPs estimation
from UV markers {Section 5.3}
3D FAPs estimation from normal markers
{Section 4.3-4}
3D position estimation from real versus mirrored point correspondence
{Section 3.2}
UV marker extraction in video {Section 5.2}
Normal marker extraction in video
{Section 4.3}
Video recording under UV light {Section 5.2}
Video recording under normal light {Section 4.1}
Figure 1.3 The flow chart of the proposed work on analysis and synthesis of realistic 3D facial animation. The corresponding chapter or section of each component is also annotated.
Figure 1.4 The 3D facial motion trajectories estimated with the proposed algorithm for realistic facial animation. The red points in the lower part of the figure represent the estimated markers’ 3D positions, and the upper part depicts synthesized facial animation of the pronunciation “O-U”.
Chapter 2
Related work
2.1 Introduction
Since the proposed work comprises techniques of 3D structure reconstruction, facial motion estimation and face synthesis, some state-of-the-art researches and statuses in these three domains are introduced in this chapter.
2.2 3D Structure Reconstruction from multiple views
3D structure reconstruction is an essential process for 3D computer vision, e.g.
3D object modeling, 3D object recognition, and 3D motion tracking.
Multiple view directions are required to reconstruct 3D structure from images.
Most of modern stereovision-based 3D structure estimation approaches derive from epipolar constraints. These approaches first use corresponding points in images of different viewpoints to estimate the essential matrix. Then, the rotation R and translation t between cameras are decomposed from the essential matrix. Finally, each point’s 3D position can be estimated by intersecting casting vectors from the cameras’ optic center. [LONG81, ZHAN92, ZHAN95, WENG93, HUAN94]
provide a good reference or discussion on estimating 3D structure or the essential matrix from images.
Using multiple cameras simultaneously is a common approach to acquire multiple views. Since each camera can have different parameters, this will cause
different distortions on each image. Therefore, before reconstruction methods mentioned above are performed, captured images have to be undistorted and normalized in advance. Generally, two kinds of camera parameters, extrinsic parameters and intrinsic parameters, are mainly concerned. Extrinsic parameters are the rotation and translation that related the world coordinate to the camera coordinate system; intrinsic parameters, also called the camera model, comprise lens distortions, the focal length, the principal point, etc. R.Y. Tsai’s method provides a paradigm in camera calibration [TSAI87]. Recently, J. Heikkilä and O. Silvén’s camera calibration procedure [HEIK97, BOUG] and Z. Zhang’s flexible calibration method [ZHAN00] are widely utilized.
Besides multiple cameras, placing mirrors in a scene is another way to acquire multiple views. A.R.J. François et al [FRAN03] proved that 3D reconstruction from a single perspective view of a mirror symmetric scene is geometrically equivalent to reconstructing the scene with two cameras symmetric to the unknown 3D symmetry plane. The advantage of utilizing mirrored views is that only one camera is necessary, and errors caused by imperfect calibration between cameras can be avoided. However, the locations or orientations of mirrors have to be estimated in this case.
In the research proposed by H. Mitsumoto et al. [MITS92], they recovered the plane symmetry using the vanish point. D.Q. Huynh [HUYN99] proposed an affine reconstruction method from a monocular view with a symmetry plane. In his method, he reconstructed a 3D object via solving the epipole with a symmetry plane constraint. In our proposed method, 3D positions are reconstructed via estimating the mirror plane from projected corresponding points.
Even though, Huynh’s and our method took different points of views in the beginning, we found that Huynh’s solution for the restricted epipole is equivalent to our mirror plane estimation. Huynh's work focused on the problem solving of 3D reconstruction with a symmetry plane and discussed the advantage of non-linear computation. On the other hand, our work not only estimates 3D positions but also take advantage of the perfectly-synchronized property between multiple mirrored views to track 3D facial motion. Furthermore, we did computer simulations and deduced the theoretical expected errors to manifest the outstanding benefits of 3D position and motion estimation from mirror-reflected video clips compared to two-view algorithms, where multiple cameras are applied. We also discuss the
advantages and disadvantages between the proposed method and two view approaches.
In addition to plane mirrors, J. Guckman and S.K. Nayar [GLUC99, GLUC02]
presented stereo sensors using a single camera with various combinations of mirrors, e.g. two spherical mirrors, two convex mirrors, and four planar mirrors.
2.3 Facial Motion Tracking
Depending on applications, different devices and sensors are used for facial motion tracking. T. Goto et al. [GOTO01] proposed a simple procedure to roughly extract motion of feature points on a bare face from video. FaceStation developed by Eyematic Interface Inc. [FACE] can also automatically locate and track facial features from a video camera. These kinds of systems can provide user-friendly interfaces for exaggerated and expressive facial animation. Nevertheless, while an application requires accurate 3D facial motion or requires motion of points besides distinct facial features, e.g. points on cheeks or on the forehead, conspicuous markers are usually necessary to adhere to a subject’s face.
About 3D facial motion tracking from multiple cameras, an optoelectronic system, e.g. Optotrak [OPTO, HAVE96], uses optoelectronic cameras to track infrared-emitting photodiodes on a subject’s face. Since the root mean square (RMS) error of this system can be as low as 0.1mm in horizontal and vertical and 0.15mm in depth, such an instrument suffices for research demanding high accuracy such as facial biomechanics or co-articulation effect analysis. However, each diode needs to be powered by wires, which may interfere with a subject’s facial motion.
Video-based systems that apply passive markers avoid this problem. For example, the VICON series [VICO] uses six to 24 specifically designed cameras with resolution 1280x1024 pixel2 and frame rate 60 to 1000 fps to capture markers’
motion in visible or infrared spectrums. This kind of costly motion capture system is popular in the computer graphics industry for movies or video games. They usually make use of protruding spherical markers for easiness of shape analysis, but these markers don’t work well for lip surface motion tracking because people sometimes tuck in or otherwise obstruct lip surfaces. Besides, the extracted motion of protruding markers is not the exact motion on a face surface but the motion at a
small distance above the surface.
In addition to capturing stereo videos with multiple cameras, E.C. Patterson et al. [PATT91] proposed using mirrors to acquire multiple views for facial motion recording. They simplified the 3D reconstruction problem and assumed a plumb camera and vertical mirrors. S. Basu et al. [BASU97, BASU98] employed a front view and a mirrored view to capture 3D lip motion. In their work, they regarded the mirrored view as a flipped image of a virtual camera and applied a general-purpose stereovision approach to estimating 3D lip motion. We also apply mirrors for acquirement of new images with different view directions. However, Our algorithm proves simpler yet more accurate because it conveniently uses nice symmetric properties of mirrored objects.
Some devices and researches take other concepts to estimate 3D motion or structure. For example, the ShapeSnatcher system [SHAP, KALB01] projects grids onto a face, and therefore it can extract 3D shape and texture from a single image.
2.4 Human face synthesis
In general, the framework of a synthetic face, the controls of facial expression, and the driven events for facial animation are three principal considerations in a facial animation system.
Researches for the framework of a synthetic face can be approximately classified into three categories: 2D-mesh-based, 3D-polygon-based, and image-sample-based. The 2D-mesh-based approach is the most easily controlled and computationally effective design. Only a single image texture and a face mesh are required to construct a synthetic face [PERN98]. The main disadvantage is that the view directions of a 2D face are limited and it is difficult to be combined into a 3D graphical environment. Most researches adopted the 3D-polygon-based approach to avoid problems mentioned above. Modeling and controlling a 3D face is much more delicate. Laser scanners such as those produced by Cyberware Corp. [CYBE] can acquire a precise 3D face shape with texture mapping. W. Lee et al. [LEE99] applied a semi-automatic approach, which modeled a 3D face model based on the orthogonal view images of a person. In F. Pighin and others’ work [PIGH98], photographs taken from different view directions were integrated to construct a
delicate face model with view-dependent texture mapping. V. Blanz et al. [BLAN99]
established an excellent system to build a personalized 3D head model from only a single face image by statistic information of human heads.
Image-sample-based systems synthesize faces and facial expressions by metamorphing between several photographic images. The morphing technique proposed by T. Beier and S. Neely [BEIE92] made an impressive animation of transitions between different faces in Michael Jackson’s MTV “Black or White”.
“Video Rewrite” proposed by C. Bregler et al. [BREG97] synthesized video realistic facial animation by combining image samples of a face and mouth according to input phonemes. E. Cosatto et al. [COSA00] further decomposed the samples into smaller facial parts and formed a sample space. These made the synthetic process with more flexibility and efficiency. Up to the present, the image-sample-based approach could be the most realistic one among all the approaches, but it suffers the same disadvantages of the 2D-mesh-based approach, where the view directions are limited. This problem can be solved by the view morphing technique [SEIT96].
Nevertheless, a large database of image samples or heavy computation are indispensable. In addition, it is difficult to apply the sample data to others’ faces.
For facial expression synthesis, a muscle-based approach imitates anatomy of human faces, which controls expressions by adjustment of interior muscles. K.
Waters [WATT87] developed a dynamic face model with linear muscles and sphincter muscles. In D. Terzopoulous and K. Waters’ research [TERZ90], a face tissue model of a three-layer structure was proposed to simulate skin, subcutaneous tissue, and muscles. The muscle-based approach is conformed to the facial action coding system (FACS) [EKMA78] and is suitable to model exaggerative expressions.
From F.I. Parke and K. Waters’ words [PARK96] “FACS seems complete for reliably distinguishing actions of the brows, forehead, and eyelids. FACS does not include all of the visible, reliably distinguishable actions of the lower part of the face”, human faces and lips are so subtle that an approximate model can still hardly simulate many fine variations.
Since the exterior face shape is the main concern in computer animation, the feature-point driven approach simulates facial expression by controlling the feature point on a synthetic face surface directly. This kind of approach assumes the exterior face shape as several parametric surfaces, such as bicubic surfaces [REEV90], or radial-basis functions [NIEL93, PIGH98]. The advantage of feature-point driven
facial expression is that facial expressions can be generated intuitively from motion captured data or manual adjustment. On the other hand, it requires a lot of control points to synthesize those subtle facial motions.
As mentioned in Chapter 1, performance-driven, text-driven and speech-driven facial animations are three major approaches to drive synthetic faces. The performance-driven approach synthesizes facial animation directly from captured motion data. Guenter et al. [GUEN98] produced a remarkably lifelike facial animation by abundant motion captured information. They recorded 182 dot markers’ positions and facial textures on a subject’s face on 30 frames per second. A text-driven talking head translates each input word into visemes and performs animation by interpolating visemes according to input time stamps. To produce voices of a text-driven talking head, a text-to-speech (TTS) module synthesizes auditory speech according to time stamps synchronously. A speech-driven face system is quite similar to the text-driven one but translates visemes from input natural speech instead. The benefit of a speech-driven face system is that it uses natural speech as output voice and does not suffer the unnatural synthetic voice as that in the text-driven one. However, the faithfulness of facial animation depends on how accurate the input speech is recognized. A common problem in text-driven and speech-driven facial animation is the motion transition function between visemes.
M.M. Cohen and D.W. Massaro [COHE93] introduced several hypotheses. The voice puppetry proposed by Brand [BRAN99] further applied the Hidden Markov Model (HMM) to approximate facial motions driven by various audio features. T.
Ezzat et al. [EZZA02] proposed a multidimensional morphable model (MMM) to synthesize novel facial motion trajectories based on a small set of prototypes.
There are some other related researches on synthetic faces. A wavelet-based method for prototyping facial textures and transforming the age of facial images is presented by B. Tiddeman et al. [TIDD01]. Z. Liu et al. [LIU01] proposed synthesizing delicate details on a face with expression ration images (ERI). Noh and Neumann [NOH01] presented a method to retarget facial motions and preserve the relative features of original facial animation.
Chapter 3
3D Position Estimation from
Mirror-reflected Multi-view video
3.1 The Problem Statement
As the conceptual diagram shown in Figure 3.1, a mirrored image can be regarded as a “flipped” image taken by a “virtual camera”, which is in a distinct view direction comparing to the real one. For the detailed proof of the relation between real and virtual cameras, we refer to the reference [FRAN03]. With two mirrors next to a subject’s face, we can simultaneously acquire three facial images from different viewpoints and can also avoid the problem of data synchronization among different cameras.
Before we use these images for 3D position estimation, orientations and locations of mirror planes must be estimated in advance. Some research [ZHAN98]
required explicit measuring these properties. Explicit measurement is not user-friendly and reliable, or precise devices must be employed. Hence, accurate methods to directly handle the whole process from image sequences are necessary.
In some related researches [BASU97, BASU98], 3D positions of the aforementioned situation were estimated by modified general-purpose stereovision approaches, which estimate the affine transformation (rotation R, translation T) between two cameras from the essential matrix [LONG81, WENG89, WENG93, HAR95]. After evaluating the rotation and translation between two cameras, the 3D position of a target can then be approximated from intersection of cast rays from optical centers of different cameras.
However, there are some nice properties of mirrored images that can be utilized to deduce a more accurate 3D reconstruction algorithm for mirror-reflected multi-view images. We present our approach in the following section.
Figure 3.1 The conceptual diagram of “virtual cameras”. Properties of a virtual camera, including intrinsic and extrinsic parameters, are symmetric to those of an actual camera with respect to a mirror plane.
3.2 The Proposed Closed-form Linear Algorithm
In this section, we introduce our algorithm for 3D position estimation in condition of one mirror, which can be easily extended to two mirrors’ condition. We assume that input images have been normalized by camera calibration processes [HEIK97, BOUG, ZHAN00], and we also assume that real and mirrored markers’
projected positions and correspondences have been extracted. The details of marker extraction, tracking and recovering point correspondences under normal light and blacklight conditions are presented in Chapter 4 and 5.
With the point correspondences, now, we can calculate markers’ 3D positions by first evaluating mirrors’ orientations and locations in the camera coordinate
system and then estimating markers’ 3D positions as a minimization problem.
In the first step, we assume plane mirrors and use only the image data within the m
ax + by + cz = d (3.1)
u = (a, b, c)t, ||u|| = 1, where u
Figure 3.2 The geom ected virtual
e projection points p,
irrors’ range. A mirror’s location and orientation can be represented using a plane equation:
is the plane’s unit normal and vector u has two possible directions. Without loss of generality, we take the direction of c < 0. In the following discussion, we assume that I is the camera film’s image plane and f is the focal length. (If images are undistorted and normalized according to the normalized camera model, f is 1.0.) We assume the camera’s lens center O to be the origin in the coordinate system, and the camera’s line of vision, also called the optic axis, is the positive z axis.
point m′, and th
etric representation of a physical point m, the refl p′ .
In Figure 3.2, m is the actual 3D siti po ion of marker i, mi = (xmi, ymi, zmi)t, and is the 3D position of virtual marker i in the mirrored space, .
m′ i mi′=(xmi′ ,ymi′ ,zmi′ )t
wher e estimated markers’ 2D positions.
r tate that Mirror prope ties dic
ku m mi′= i +
oplanar, and thus
(3.2)
and simplify it as
and simplify it as