FAP E XTRACTION - RELATED WORK - Reliable Extraction of Realistic 3D Facial Animation Parameter

2. RELATED WORK

4.4 FAP E XTRACTION

In the previous step, markers’ 3D motion trajectories have been estimated.

However, a subject under test may swing or nod his or her head when speaking and making facial expressions, and thus the motions of 3D markers are composed of not only facial motions but also head motions. To acquire precise facial motion, the head motion must be estimated and removed from 3D motion trajectories.

As mentioned in T.S. Huang and A. Netravalis’ review [HUAN94], with 3 non-collinear 3D points, the movement of rigid object can be uniquely determined by a rotation matrix Rhead, and translation vector Thead.

) ( )

1 ( ) ( )

(t R t ms T t

ms_s = _head × _s + _head (4.7)

where mss(t) is 3D position of a specific point s on a rigid object at time t, and where mss(1) is the 3D position of point s on a rigid object at the initial time.

Therefore, the 3D data of more than 4 additional markers placed on the performer’s ears are regarded as points on a rigid body, and we applied an algorithm for rigid-body motion estimation proposed by K. Arun et al. [ARUN87] to determine the head rotation Rhead(t) and head translation Thead(t) for each video clip. After the rotation and translation of successive time stamps are determined, we can extract 3D facial motion of marker i at time t without head movement as

(

( ) ( )

)

) ( )

(t R ¹ t m t T t

fm_i = _head⁻ _i − _head (4.8)

where mi(t)is the original estimated 3D position of marker i at time t.

Chapter 5 Fully Automatic Mass 3D Marker

Tracking Under Blacklight-UV Lamps

5.1 Introduction

In Chapter 4, a semi-automatic 3D motion tracking procedure is proposed. User intervention is necessary in the above-mentioned system for two reasons. The first reason is due to difficulty of marker identification. Under a normal light condition, reliable extraction of makers is not easy since markers’ colors and projected shapes can change dramatically in different reflective angles, and some facial parts may occasionally be misjudged as markers for the same explanation. To avoid extracting markers, we apply block matching for tracking, it compares the color variation between previous and successive video clips. But, we still have to identify where the markers are in the first frame and manual selection is required. The second reason is due to ambiguity and occlusion in tracking. While applying block matching for tracking, perturbation of markers’ reflective colors and projected shapes can make the tracking trajectories “trembling”. By chance, it may even make the tracking

“derail” to where there is no marker. We utilize adaptive Kalman filters to alleviate these situations. But, occlusion is still the most critical problem to prohibit the tracking method in Chapter 4 from being fully automatic. For example, when our mouths are pouted or greatly opened, the markers below the lower lips vanish in video clips. We also tried to use thresholding in block matching and Kalman predictors to tackle this problem. Notwithstanding, it works satisfactorily only for short-term marker occlusion.

To be fully automatic tracking, some researches employed a generic facial motion model. T. Goto et al. [GOTO01] utilized separate simple tracking rules for eyes, lips, etc. respectively. F. Pighin et al. [PIGH99] proposed to track animation-purposed facial motion based on linear combination of 3D face model bases. In the “voice puppetry”[BRAN99], M. Brand applied a generic head mesh with 26 feature points, where spring tensions are assigned to each edge connection.

Such a generic facial motion model can rectify “derailing” tracking trajectories and is beneficial for sparse feature tracking. However, an approximate model can also restrict the feature tracking while a subject does prominent or extraordinary facial expressions.

From another aspect, applying special lights to highlight markers is effective to improve the feature extraction. As mentioned in Chapter 2, active markers, which emit infrared rays, or passive markers, which is of high infrared response, are all widely utilized in industrial motion capture products. E.C. Patterson et al. presented a facial tracking system with a dozen passive markers for ultraviolet (UV) light. In the research of B. Guenter et al. [GUEN98], they track 182 dot markers painted with fluorescent pigments for near UV light. This research not only used special markers and lights to enhance the feature detection, but also take into account the spatial and temporal consistency for reliable tracking.

B. Guenter and others’ work [GUEN98] inspired our tracking method proposed in this chapter. We also apply markers with special pigments for blacklight blue (BLB) fluorescent lights to considerably increase the distinctness of markers from others in color space; we also utilize the spatial and temporal coherence to detect and compensate the missing and false-detection problems in tracking. However, the proposed method is more efficient and versatile. It is able to capture more than 300 markers and will be extended to track more than 100 markers from live videos in real time on a regular pc. In Guenter and others’ work [GUEN98], a subject’s head was required to be immobile due to the limitation of markers’ vertical order in their marker matching routine, and therefore, the head movement then must be tracked independently as a postprocess. By contrast, the proposed method is capable of fully automatic tracking both facial expressions and head motions simultaneously.

In Section 5.2, how to detect markers with fluorescent pigments is presented;

the tracking procedure that can be fully automatic is proposed in Section 5.3.

5.2 Equipment Setting and Feature Extraction

In order to enhance the distinctness of markers from others in video clips, we apply UV-responsive markers and UV blacklight blue (BLB) lamps. Here, an introduction to these devices and the UV light are briefly mentioned.

Ultraviolet (UV) light represents a section of the light spectrum, extending from the blue end of the visible (400nm) to the x-ray region (100nm). It can activate some materials, such as phosphors, to the luminescence condition. Luminescence is composed of fluorescence and phosphorescence. The main difference between these two conditions is the period of radiation. Fluorescence vanishes but phosphorescence continues for a while as the UV radiation stops. UV light is further divided into three subsections, UV-A, UV-B, and UV-C. UV-A light is the longest wavelength (400nm-315nm) and the lowest energy among three subsections and is also referred to as “black light”. Since the black light is less harmful comparing to the most aggressive component UV-B, it is usually used to detect counterfeit money in banks or for special effects in nightclubs or theaters.

In the proposed tracking system, we also utilize the fluorescent phenomenon to emphasize markers in video clips. Markers are covered with fluorescent pigments and blacklight blue lamps are used to excite fluorescence of markers. The fluorescence is visible in the visible light spectrum, and therefore, we don’t need special attachment lenses for light filtering. Figure 5.1 is a photo of the proposed tracking equipment taken under normal light and Figure 5.2 is a video clip captured by a digital video camera, where fluorescence of markers is excited by UV light. For further introduction of UV light and luminescence, please refer to the bibliography [ILLU85].

Figure 5.3 is an example and the flow diagram of our marker extraction method.

As shown in the original video clip (Figure 5.3(a)), owing to application of UV-responsive markers and blacklight blue lamps, markers are prominent comparing to others in video clips; therefore, the automatic feature extraction is more reliable and more feasible than feature extraction in the normal light condition.

We mainly follow the methodology of connected component analysis in computer vision, which are composed of thresholding, connected component labeling and region property measurement, but we also slightly modify the implementation for computational efficiency. With the modification, our system can be extended to

real-time tracking for live videos on a regular pc.

Since the intensity of UV-responsive markers is much higher than that of others, to exclude pixels that have less probability of marker projection, the first stage is color thresholding. Thresholding distinguishes pixels with higher R, G, and B values from pixels with lower values. Figure 5.3(c) is a color-thresholded image, and pixels that pass the threshold are displayed in white. The threshold is determined empirically.

In many feature extraction systems, mathematical morphology operations, such as dilation, erosion, opening, closing, etc., are performed before or after color thresholding, but our system are not. That is because thresholding works satisfactorily in most cases; the most difficult case, interlaced scan lines as shown in Figure 5.3(g), can be solved more efficiently by merging nearby connected components.

The second stage is color labeling. In our experiment, we collected six UV-responsive markers that are pink, yellow, green, white, blue, and purple when illuminated by normal fluorescent lamps, but there are only four typical colors, pink, blue-green, dark blue and purple while illuminated by blacklight blue lamps (as listed in Figure 5.3(d)). Hence, four classes of colors are adopted and each color class comprises dozens of color samples. A selection tool is provided to select these color samples from training videos. To classify the color of a pixel in video clips, the nearest neighborhood method (1-NN) is applied. To diminish the classification error resulting from intensity variation, the matching operation work on a normalized color space (nR, nG, nB), where

, ( )

) (

( R G B

nB B B G R nG G B G R nR R

= + +

= +

and (R, G, B) is the original color value. In general, the more color samples is in a color class, the more accurate color class of a pixel is classified. For real-time or near real-time applications, four color samples in each color class are sufficient.

Figure 5.3(e) is the color-labeled image represented by typical colors in color classes.

Connected component labeling is the third stage in our feature extraction. It groups connected pixels with the same color label number as a component and we adopt 8-connected neighbors. Several connected components algorithms were proposed, for examples, iterative algorithms, the classical algorithm, space-efficient

two-pass algorithms, etc. [HARA92]. These algorithms are general-purpose and take into account all circumstances of connection. Nevertheless, in our case, a marker’s projection is smaller than a radius of 5 pixels, and thus, the process of connected component labeling can be much simplified. We modify the classical algorithm as the following C-like pseudo-code:

void PreliminaryCCL() {

//initialization

for( c =0; c < color_class; c++) { newLabel[c]=0;

}//for i //labeling

for( i = 0 ; i < I_height; i++) { for( j=0; j < I_width; j++) {

cl = ColorLabel[i][j];

if(IsValidColorGroup(cl) {

A = PrecedingNeighbors(i, j, cl);

if(IsEmpty(A) {

CCL[i][j] = newLabel[cl];

newLabel[cl]++;

} else {

CCL[i][j] = MIN(A);

}//else }//if

}//for j }//for i

}//void PreliminaryCCL()

In the pseudo-code, the result of preliminary CCL are placed in the array CCL,

and the function PrecedingNeighbors(i, j, cl) for the pixel at (i, j) which collects valid CCL values at (i-1, j-1), (i, j-1), (i+1, j-1), and (i-1, j). The PreliminaryCCL, unlike the classical algorithm, checks only preceding neighbors. Not all 8-connected components can be labeled as the same group by PreliminaryCCL since we do not utilized a large equivalent class table for transiting label numbers as in the classical one. But the inconsistency is local and can easily solve in our next stage. A result image processed by PreliminaryCCL is shown in Figure 5.3(f).

After the process of preliminary connected component labeling, there are still redundant connected components caused by interlaced fields of video, incomplete connected component labeling or noise (as shown in Figure 5.3(g)). The fourth stage is to refine the connected components to make each extracted components as close as the actual markers’ projection. Since markers are placed evenly on a face and the shortest distance between two markers of the same color class is less then diameter of a dot marker, nearby connected components should belong to the same marker.

Therefore, the first two kinds of redundant connected components can be simply tackled by merging components of a distance less than the markers’ average diameter. For the redundant components caused by noise, we suppress them by removing connected components less than four pixels. Figure 5.3(h) is the refinement of Figure 5.3(g); Figure 5.3(i) is the extracted markers’ projection.

The extracted connected components are still not equivalent to the actual markers’ projection. Responsive colors of a fluorescent marker can still vary due to changes of view direction. This may result in erroneous classification of color classes and cause missing or redundant extraction of projected marker. Besides, the position of an extracted marker is also disturbed by noise.

To compensate the imperfect feature extraction, the following section proposes a procedure to automatically track 3D motions of mass markers with missing and false-detection in feature extraction.

Figure 5.1 The tracking equipment for the blacklight condition. The photo is taken under normal light. Two “Blacklight Blue”(BLB) lamps are placed in front of a subject and mirrors. The low-cost special lamps are coated with fluorescent powders, and it can emit long wave UV-A radiation to excite luminescence.

Figure 5.2 A captured video clip of fluorescent markers illuminated only by UV

“Blacklight Blue” lamps. The fluorescence is visible in the visible light spectrum and no special lens is required for filtering. (300 markers are evenly pasted upon a subject’s face.)

Feature extraction of UV-responsive markers

An original video clip composed of a front and two side views:

(a) The RGB color histogram:

(b) Valid pixels that passed the color threshold:

(c)

Color thresholding:

RGB (65,76,92)

n classes of color samples: (n=4, in this case)

(d)

The color-labeled image represented by typical colors of color classes:

(e)

The preliminary connected component labeled image:

(A red cross represents the center of a connected component)

(f)

Efficient grouping by a simplified connected component labeling method.

Color labeling by the nearest normalized

color sample.

Redundant connected components caused by interlaced effects, incomplete connected component labeling or noise:

(g)

Refinement of connected components: (Merging nearby components of the same color class and removing those with pixels less than 4)

(h)

Extracted projected positions of UV-responsive markers:

(A red cross represents the center of an extracted markers’ projection)

(i)

Refinement of connected components

Figure 5.3 An example and the flow diagram of feature extraction of UV-responsive markers. The process starts from color thresholding, color labeling, connected component grouping to refinement.

5.3 A Fully-Automatic Tracking Procedure for Mass 3D Markers

Fully automatic tracking multiple target trajectories over time is an important problem, called the “multitarget tracking problem”, in radar surveillance systems [BUCK00]. With only measurement error and false detection, this problem is equivalent to the minimum cost network flow (MCNF) problem. The optimal solution is feasible and the computation complexity is O(N³logNC), where N is the number of nodes in network and C is the maximum value of the coefficients among edges [CAST90, WOLF89]. Nevertheless, when measurement errors, missing detection and false alarms all occur in tracking, time-consuming dynamic programming, etc. are required to estimate approximate trajectories and the tracking results can degenerate seriously as the number of missing detection slightly increases [BUCK00]. As mentioned in the previous section, even though fluorescent markers and blacklight lamps are used to enhance the distinctness of markers and to improve the steadiness of markers’ projected colors, missing and false detection are still unavoided in the feature extracting process.

Fortunately, markers’ motion on a facial surface is unlike that of targets tracked in radar systems. Targets in the general multitarget-tracking problem are moved independently and consequently only the prior trajectory of a target can be utilized to conjecture the target’s movement from detected candidates over time. By contrast, points on a face surface have not only earlier information but also spatial coherence within the current time stamp. Except for the mouth, nostrils and eyelids, most parts on a face are continuous surfaces, and position and motion of a facial point are similar to those of its neighbors. With this additional property, automatic diagnosis of missing and false detection becomes feasible and the computation is more efficient.

Figure 5.4 is the flow chart of the proposed tracking procedure for the UV light condition. Here, we first present issues encountered in 3D motion tracking of mass UV-responsive markers and our proposed solutions.

Equipment setting

In the first step, equipment setting, two mirrors, and two UV-blacklight blue lamps are placed in front of a video camera as shown in Figure 5.1. As mentioned in previous chapters, the camera’s intrinsic parameters are first estimated by camera calibration methods [HEIK97, ZHAN00], and we adopt a well-organized camera

calibration library developed by J.-Y. Bouguet [BOUG]. All operations in the following steps are performed in the normalized camera coordinate based on the evaluated camera parameters. The mirror planes orientations and locations are then estimated by the proposed method introduced in Chapter 3. All the equipments should be fixed stably to avoid re-calibration of device parameters.

Recovering point correspondence in the neutral face

Initialization of the tracking procedure is to reconstruct the 3D positions of markers in the first frame. In Chapter 4, since the markers in video clips are not distinct enough, user interaction is required to explicitly specify all markers’

projected positions and point correspondence. Comparatively, UV-responsive markers are much more distinct and markers’ projected positions can be automatically estimated by the method presented in Section 5.2. For efficient recovering point correspondence in the first frame, two ways are utilized for different conditions.

The first approach is to employ 3D range scanned data. Figure 5.5 shows the operation of a 3D laser scanner and Figure 5.6 shows the process to recover point correspondences. First, markers’ projected positions are extracted (as shown in Figure 5.6(a)). Then, a user has to manually select n (n>3) corresponding point pairs on the nose tip, eye corners, mouth corners, etc. in the first video clip to form a 3D point set. After corresponding feature points in 3D scanned data are also designated, the affine transformation between 3D scanned data and specified markers’ 3D structure can be evaluated by a least square solution proposed by K.S. Arun et al.

[ARUN87]. While we extend the vector op , where o is the lens center and p_i i is the extracted projected position of marker i in the frontal view, the intersection of line op and 3D scanned data are regarded as the conjectured 3D position of marker i, i

denoted as mi. The corresponding point in a side view is then recovered by mirroring mi and projecting the mirrored one back to the image plane. Due to perturbation of measurement noise, the nearest point of the same color within a tolerant region is regarded as the corresponding point p′ . _i

The second approach is to recover point correspondences by evaluating a subject’s 3D face structure from rigid-body motion directly. If an object is rigid or not deformable, affine transformation (rotation R and translation t) resulting from

motion is equivalent to the inversed affine transformation resulting from changes in the coordinate system. And therefore, reconstructing 3D structure from rigid-body motion is equivalent to reconstructing 3D structure from multiple views [ZHAN92, WENG93, HUAN94]. With this property, we require a subject to retain his or her face in a neutral expression and slowly move his or her head toward four directions:

right-up, right-down, left-down, and left-up. A preliminary 3D structure of the face can be estimated from markers’ projected motion in the frontal view, and point correspondence can then be recovered.

Construction of 3D candidates by mirrored epipolar lines

If Nf and Ns feature points of a certain color class are extracted in the frontal and side views respectively, each point corresponding pair can generate a 3D candidate, and therefore, there are total NfNs 3D candidates of the color class. In B.

Guenter and others’ work [GUEN98], they took all these NfNs potential 3D candidates to track Nmrk markers’ motion. However, in a two-view system, given a point pi in the first image, its corresponding point is constrained to lie on a line called the “epipolar line” of pi [ZHAN95, HARA93]. With this constraint, one only has to search features along the epipolor line. The number of 3D candidates

在文檔中 Reliable Extraction of Realistic 3D Facial Animation Parameters from Mirror-reflected Multi-view video clips Ph.D. Dissertation (頁 50-0)