• 沒有找到結果。

5. FULLY AUTOMATIC MASS 3D MARKER TRACKING UNDER

5.3 A F ULLY -A UTOMATIC T RACKING P ROCEDURE FOR M ASS 3D M ARKERS

Fully automatic tracking multiple target trajectories over time is an important problem, called the “multitarget tracking problem”, in radar surveillance systems [BUCK00]. With only measurement error and false detection, this problem is equivalent to the minimum cost network flow (MCNF) problem. The optimal solution is feasible and the computation complexity is O(N3logNC), where N is the number of nodes in network and C is the maximum value of the coefficients among edges [CAST90, WOLF89]. Nevertheless, when measurement errors, missing detection and false alarms all occur in tracking, time-consuming dynamic programming, etc. are required to estimate approximate trajectories and the tracking results can degenerate seriously as the number of missing detection slightly increases [BUCK00]. As mentioned in the previous section, even though fluorescent markers and blacklight lamps are used to enhance the distinctness of markers and to improve the steadiness of markers’ projected colors, missing and false detection are still unavoided in the feature extracting process.

Fortunately, markers’ motion on a facial surface is unlike that of targets tracked in radar systems. Targets in the general multitarget-tracking problem are moved independently and consequently only the prior trajectory of a target can be utilized to conjecture the target’s movement from detected candidates over time. By contrast, points on a face surface have not only earlier information but also spatial coherence within the current time stamp. Except for the mouth, nostrils and eyelids, most parts on a face are continuous surfaces, and position and motion of a facial point are similar to those of its neighbors. With this additional property, automatic diagnosis of missing and false detection becomes feasible and the computation is more efficient.

Figure 5.4 is the flow chart of the proposed tracking procedure for the UV light condition. Here, we first present issues encountered in 3D motion tracking of mass UV-responsive markers and our proposed solutions.

Equipment setting

In the first step, equipment setting, two mirrors, and two UV-blacklight blue lamps are placed in front of a video camera as shown in Figure 5.1. As mentioned in previous chapters, the camera’s intrinsic parameters are first estimated by camera calibration methods [HEIK97, ZHAN00], and we adopt a well-organized camera

calibration library developed by J.-Y. Bouguet [BOUG]. All operations in the following steps are performed in the normalized camera coordinate based on the evaluated camera parameters. The mirror planes orientations and locations are then estimated by the proposed method introduced in Chapter 3. All the equipments should be fixed stably to avoid re-calibration of device parameters.

Recovering point correspondence in the neutral face

Initialization of the tracking procedure is to reconstruct the 3D positions of markers in the first frame. In Chapter 4, since the markers in video clips are not distinct enough, user interaction is required to explicitly specify all markers’

projected positions and point correspondence. Comparatively, UV-responsive markers are much more distinct and markers’ projected positions can be automatically estimated by the method presented in Section 5.2. For efficient recovering point correspondence in the first frame, two ways are utilized for different conditions.

The first approach is to employ 3D range scanned data. Figure 5.5 shows the operation of a 3D laser scanner and Figure 5.6 shows the process to recover point correspondences. First, markers’ projected positions are extracted (as shown in Figure 5.6(a)). Then, a user has to manually select n (n>3) corresponding point pairs on the nose tip, eye corners, mouth corners, etc. in the first video clip to form a 3D point set. After corresponding feature points in 3D scanned data are also designated, the affine transformation between 3D scanned data and specified markers’ 3D structure can be evaluated by a least square solution proposed by K.S. Arun et al.

[ARUN87]. While we extend the vector op , where o is the lens center and pi i is the extracted projected position of marker i in the frontal view, the intersection of line op and 3D scanned data are regarded as the conjectured 3D position of marker i, i

denoted as mi. The corresponding point in a side view is then recovered by mirroring mi and projecting the mirrored one back to the image plane. Due to perturbation of measurement noise, the nearest point of the same color within a tolerant region is regarded as the corresponding point p′ . i

The second approach is to recover point correspondences by evaluating a subject’s 3D face structure from rigid-body motion directly. If an object is rigid or not deformable, affine transformation (rotation R and translation t) resulting from

motion is equivalent to the inversed affine transformation resulting from changes in the coordinate system. And therefore, reconstructing 3D structure from rigid-body motion is equivalent to reconstructing 3D structure from multiple views [ZHAN92, WENG93, HUAN94]. With this property, we require a subject to retain his or her face in a neutral expression and slowly move his or her head toward four directions:

right-up, right-down, left-down, and left-up. A preliminary 3D structure of the face can be estimated from markers’ projected motion in the frontal view, and point correspondence can then be recovered.

Construction of 3D candidates by mirrored epipolar lines

If Nf and Ns feature points of a certain color class are extracted in the frontal and side views respectively, each point corresponding pair can generate a 3D candidate, and therefore, there are total NfNs 3D candidates of the color class. In B.

Guenter and others’ work [GUEN98], they took all these NfNs potential 3D candidates to track Nmrk markers’ motion. However, in a two-view system, given a point pi in the first image, its corresponding point is constrained to lie on a line called the “epipolar line” of pi [ZHAN95, HARA93]. With this constraint, one only has to search features along the epipolor line. The number of 3D candidates decreases substantially and the computation is much more efficient.

There is a similar constraint in mirror-reflected multi-view images. Since a mirrored view can be regarded as a flipped view from a virtual camera, the constraint is also tenable but flipped. We call this mirrored constraint “mirrored epipolar line”. We briefly introduce the concept of the mirrored epipolar line by Figure 5.7. We assumed that p is an extracted feature point, o is the optic center, and

, the unknown corresponding point in the mirrored view, is unknown. Since p is a projection, the actual marker’s 3D position, m, must lie on the line l

p′

op. According to the mirror symmetry property, the virtual marker’s 3D position, m′, must lie on l′ , op which is a symmetric line of lop with respect to the mirror plane. While a finite-size mirror model is adopted, the projection of l′ is a line segment and it is denoted as op

p′b

p′a . The corresponding point p′ then must lie on the mirrored epipolar line segment p ′apb , or otherwise the marker m is not visible in the mirrored view.

The mirrored epipolar line of a point p can easily be evaluated. From Equation 3.5, where (p′ Up)t =0, we expand p and p′ by their x, y, and z components and the equation becomes

[

1

]

=0 is the mirrored epipolar line of p.

Due to the perturbation resulting from noise, the corresponding point may not lie on the mirrored epipolar line exactly. To evaluate potential 3D candidates with noise tolerance, we extend the line k pixels up and down (k=1.5 in our case) to form a “mirrored epipolar band” and search corresponding points within the region between two constraint lines

0 Figure 5.8 shows an example of potential point corresponding pairs generated by the mirrored epipolar constraint; Figure 5.9 shows the 3D candidates generated from the constrained point correspondences.

Head movement estimation and removal

In Chapter 4, markers’ 3D motion trajectories are first reconstructed by a block-matching based search method with adaptive Kalman predictors and filters.

For estimation of head movement, specific markers’ trajectories are used and facial motion parameters are then evaluated by removing head movement from markers’

motion trajectories.

As we have mentioned, markers’ 3D motion trajectories comprise both facial motion and head motion. Because the moving range of a head is larger than those of facial muscles, when a subject does facial expression and moves his or her head

concurrently, most of the markers’ motion results from head motion. This situation could make the Kalman predictors and filters dominated mainly by head motion but little by facial motion. Our experiments coincide with our intuitive inference. We find that prediction and filtering are more accurate if separate Kalman predictors/filters are applied to head motion and facial motion tracking respectively.

And also, our proposed method for finding frame-to-frame 3D point correspondences is more reliable if head motion is removed in advance. Therefore, here, we change our strategy and try estimating and removing head motion before finding frame-to-frame 3D point correspondence.

We define that the head pose in the first frame (t=1) is upright, and the head motion at time t is the affine transformation of the head pose at time t with respect to the head pose at t=1. As we mentioned in Section 4.4, the affine transformation consists of rotation Rhead(t) and translation Thead(t). For automatic head movement tracking, seven specific markers are pasted on locations invariant to facial motion, such as a subject’s ears and the concave tip on the nose column. Adaptive Kalman filters are again to alleviate trembles resulting from measurement errors. It is different from the position-velocity state model of Kalman filter for each marker in Chapter 4. Another point of view is taken here.

Rhead(t) and Thead(t) are both three degrees of freedom. Thead(t) = [tx(t), ty(t), tz(t)]. abbreviations of cos( ) and sin( ). We apply Kalman filters to these six parameters [rx(t), ry(t), rz(t), tx(t), ty(t), tz(t)] directly. The process of head motion evaluation is as follows:

Step 1. Users designate specific markers si (for i=1...Nsmrk) for head moting tracking from the reconstructed 3D markers of the neutral face (t=1) and denote the position as msi(1).

(This step can be further extended to be fully automatic by clustering methods, such as the K-means algorithm (k=3), since markers on an ear are close to each other and far from those on the face.)

Step 2. Initialize adaptive Kalman filters for head motion and set rx(0) = ry(0) = rz(0) = 0, tx(0) = ty(0) = tz(0) = 0, and t = 1.

Step 3. Predict the head motion parameters rx(t+1|t), )ry(t+1|t , , , and

)

| 1 (t t rz +

)

| 1 (t t

tx + ty(t+1|t) tz(t+1|t)by Kalman predictors and then construct and TRhead(t+1|t) head(t+1 t| ) by Equation 5.4.

Increase time stamp t = t+1

Step 4. Generate predicted positions of specific markers as msi(t|t−1)=Rhead(t|t−1)×msi(1)+Thead(t|t−1)

)

−1

(5.5)

and find msi(t) by searching the nearest potential 3D candidates of the same color. The search is restricted within a spherical range centralized at

and of a radius r

| ( tt

msi srch.

If no candidate is found, set the marker ineffective at time t.

(The potential 3D candidates are constructed by the method in the above subsection.)

Step 5. Detect false tracking, whose estimated motions are odd comparing to other specific markers; set the markers of odd estimation ineffective at time t.

(The false tracking detection is presented in the next subsection. We skip the details here)

Step 6. Estimate the affine tranformation (Rmsr and Tmsr) of effective specific markers between time t and the 1st frame by the method proposed by K. S.

Arun [ARUN87].

Extract rmsr_x(t), rmsr_y(t) and rmsr_z(t) from Rmsr by Equation 5.4 and extract

tmsr_x(t), tmsr_y(t) and tmsr_z(t) from Tmsr.

Step 7. Take [rmsr_x(t), rmsr_y(t), rmsr_z(t)] [tmsr_x(t), tmsr_y(t), tmsr_z(t)] as measurement inputs to the adaptive Kalman filter and estimate the output [rx(t), ry(t), rz(t)] [tx(t), ty(t), tz(t)].

Step 8. If t > Tlimit, stop; else goto Step 3.

Operation of the Kalman filter for translation [tx(t), ty(t), tz(t)] is the same as the Kalman filter for each marker in Chapter 4, which takes 3D positions as input and the internal states are positions and velocities. The operation of the Kalman filter for rotation [rx(t), ry(t), rz(t)] is similar but the input is a set of angles and the internal states represent angles and angular velocities. With this improved procedure, the head motion tracking is more reliable and stable comparing to the process in Chapter 4, where specific markers are tracked independently.

Once the head motion at time t is evaluated, an inverse affine transformation similar to Equation 4.8 is applied to potential 3D candidates for head motion removal.

Finding frame-to-frame 3D point correspondence with outlier detection In the previous subsection, we have presented how to reconstruct markers’ 3D structure in the first frame, how to generate 3D candidates for tracking, and how to remove head motion from motion trajectories. In the subsection, we assume that head motion is removed from potential 3D candidates, and our goal is to track markers’ motion trajectories from a sequence of potential 3D candidates frame by frame.

Figure 5.10 is a conceptual diagram of the problem statement. The number of potential 3D candidates in a frame is around 1.2~2.3 times the number of the actual markers. The additional 3D candidates can be regarded as the false detection in the multiple-target tracking problem. If only false detection occurs, the graph algorithms for minimum cost network flow (MCNF) can evaluate the optimal solution. In our case, we employ Kalman predictors and filters to efficiently grasp the time-varying position variation of each marker. However, a marker can “miss” in video clips occasionally. The missing condition results from blocking or occlusion due to view directions, incorrect classification of markers’ colors or noise disturbance. While the missing and false detection occur concurrently, a simple tracking method without

evaluation of false tracking would degenerate and the successive motion trajectories could be disordered.

We use an example to explain the serious consequence of false tracking. In Figure 5.11, the marker B, is not included in the potential 3D candidates of the third frame, and its actual position is denoted as B(3). Based on the previous trajectory, B’(3) is the nearest potential candidates with respect to the predicted position.

According to this false trajectory B(1)→B(2)→B’(3), the next position should be B’(4). Consequently, the motion trajectory starts to “derail” seriously and is difficult to recover. Furthermore, false tracking of a marker may even interference with tracking of other markers. In the example of Figure 5.11, the marker C is also undetected in the fourth frame; the nearest candidates with respect to the predicted position is C’(4). Unfortunately, C’(4) is actually the marker D at the fourth frame, denoted as D(4). Because each potential candidate should be “occupied” by one marker at most, a misjudgment would not only make the marker C but also the marker D depart from the correct trajectories.

For detection of false tracking, we take advantage of the spatial coherence of face surfaces, which means a marker’s motion is similar to that of its neighbors.

Before we present our method, the terms are specified in advance. For each marker, its neighbors are other markers that locate within a 3D distance ε from the marker in the neutral face. For the motion of marker i at time t, we don’t use the 3D location difference between time t-1 and t but use the location difference between time t and time 1 instead. We denote vi(t)=mi(t)−mi(1). This is because the former is easily disturbed by measurement noise but the latter is more reliable. The motion similarity between marker i and marker j at time t is the Euclidean distance between two motion vectors vi(t)−vj(t) .

A statistical approach is used to judge whether a marker’s motion is a false tracking at time t. For each marker i, we first calculate the similarity of each neighbor and sort them in decreasing order. To avoid the judgment being contaminated by the motions of unknown false tracking of neighbors, only the first α% neighbors are included in the sample space Ω (α = 66.67, in our experiments).

We assume that the vectors within the sample space Ω approximate a Gaussian distribution. The averages and standard deviations of x, y, and z components of vj

(for j ∈ Ω) are then evaluated and denoted as (µvxvy, µvz) and (σvx, σvy, σvz)

respectively. A tracked motion vi(t) is valid if it is not far from the distribution of most of its neighbors.

The judgment criterion of valid or false tracking for the marker i is



and the judgment function is

2 neighbors Ω along the x, y, and z directions. If the divergences are within the standard deviations, the values are smaller than one; on the contrary, if the divergences are larger, the values increase. In Equation 5.7, k is a small user-defined number. With k in the denominators, we can prevent unpredictable values of Sx, Sy, and Sz when markers are close to their locations of the neutral face.

After we eliminate the false tracking of 3D candidates, a conflicting situation can still exist. Two valid motions that do not share the same 3D candidates could have the same extracted 2D feature points in either the frontal view or the side view.

We call this the tracking conflict. To prevent the tracking conflict, we simply evaluate the number of valid motions for each 2D feature point If a 2D feature point is “occupied” by more than one valid motion, we only keep the motion closest to the prediction as a valid motion.

In our experiment, the average number of appearance of false tracking in a frame is 7.45%, and that of tracking conflict is 0.34%.

Conjecturing positions of missing markers

If a false tracking is detected, the similarity of its neighbors in motion can also be used to conjecture the position or motion of the missing marker. Based on this idea, two interpolation methods are applied to the estimation. The first one is the weighted combination method. For a missing marker i, the motion at time t can be estimated by weighted combination of that of its neighbors and it can be presented

by the equation:

) Neighbor(

for 1 ,

i j

j ij

i v j v

kc

v d  ∈



=

+ (5.8)

where dij is the distance between mi and mj in the neutral face and kc is a small constant to avoid a very large weight when the marker i and j are quite close in the neutral face.

In addition, a RBF (radial basis function) based data scattering method is appropriate for the position estimation of missing markers. The above-mentioned weighted combination method tends to average and smooth the motions of all the neighbors; by contrast, the influence of nearby neighbors is greater in RBF interpolation in general (it depends on the radial basis function). And therefore, more prominent motions can be estimated. Details of the RBF interpolation are introduced in Chapter 7 for face deformation. Since the RBF interpolation is more time-consuming, the weighted combination is adopted for real-time or near real-time tracking. Figure 5.13 is the estimated motion by the proposed method; Figure 5.14 shows the tracking results by a method with Kalman filtering only and by our method with rectification of false tracking.

The complete procedure of automatic UV-responsive marker tracking is listed in the end of this chapter.

Predicted head motion

(the t+1th frame) Delay

Adaptive Kalman filters

Head motion parameters

Evaluate and remove head motion from markers’ movement

For t = 2 ~ Tend

Predicted positions (the t+1th

frame)

Refine the 3D facial motion trajectories Adaptive Kalman filters

Rectify false tracking and

conflict Delay

Find 3D point correspondences (t-1th to tth frame) Construct potential

3D candidates Feature

extraction

Reconstruct markers’ 3D positions of the 1st frame

Evaluate devices’

parameters

Figure 5.4 The flow chart of our automatic 3D motion tracking procedure for a large number of UV-responsive markers.

Figure 5.5 A 3D laser scanner is employed to acquire depth range images of a subject. We integrate two to three range scans for the 3D face structure of a subject.

(a)

(c) (b)

Figure 5.6 Recovering point correspondences with 3D scanned data and RBF interpolation.

Figure 5.7 A conceptual diagram of “mirrored epipolar line”. p is an extracted feature in the frontal view and l is the line extended by op op. is the line symmetric to by the mirror plane.

l′op

lop mamb is the projection segment of on the mirror plane.

l′op

b ap

p , the projection of mamb on the image plane I, is the mirrored epipolar line segment of p.

Figure 5.8 Candidates of point corresponding pairs under mirrored epipolar constraints. For each extracted feature in the frontal view, each feature point of the same color that lies within the mirrored epipolar band is regarded as a corresponding

Figure 5.8 Candidates of point corresponding pairs under mirrored epipolar constraints. For each extracted feature in the frontal view, each feature point of the same color that lies within the mirrored epipolar band is regarded as a corresponding