• 沒有找到結果。

Consistent Binocular Depth and Scene Flow with Chained Temporal Profiles

N/A
N/A
Protected

Academic year: 2022

Share "Consistent Binocular Depth and Scene Flow with Chained Temporal Profiles"

Copied!
22
0
0

加載中.... (立即查看全文)

全文

(1)

DOI 10.1007/s11263-012-0559-y

Consistent Binocular Depth and Scene Flow with Chained Temporal Profiles

Chun Ho Hung· Li Xu · Jiaya Jia

Received: 5 November 2011 / Accepted: 8 August 2012

© Springer Science+Business Media, LLC 2012

Abstract We propose a depth and image scene flow estima- tion method taking the input of a binocular video. The key component is motion-depth temporal consistency preserva- tion, making computation in long sequences reliable. We tackle a number of fundamental technical issues, including connection establishment between motion and depth, struc- ture consistency preservation in multiple frames, and long- range temporal constraint employment for error correction.

We address all of them in a unified depth and scene flow esti- mation framework. Our main contributions include develop- ment of motion trajectories, which robustly link frame cor- respondences in a voting manner, rejection of depth/motion outliers through temporal robust regression, novel edge oc- currence map estimation, and introduction of anisotropic smoothing priors for proper regularization.

Keywords Video depth estimation· Consistent scene flow· Chained temporal profiles · Stereo matching

1 Introduction

In many computer vision tasks, reliable depth and motion estimation fundamentally assures high quality result pro-

Electronic supplementary material The online version of this article (doi:10.1007/s11263-012-0559-y) contains supplementary material, which is available to authorized users.

C.H. Hung· L. Xu · J. Jia (



)

Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, N.T., Hong Kong, China e-mail:leojia@cse.cuhk.edu.hk

C.H. Hung

e-mail:chhung@cse.cuhk.edu.hk L. Xu

e-mail:xuli@cse.cuhk.edu.hk

duction. If depth can be accurately inferred in 3D videos with necessary temporal consistency, traditionally challeng- ing video editing to alter color, structure, and geometry, as well as the high-level scene understanding and recognition tasks can be accomplished much more easily. In addition, with the precipitate prevalence of 3D display and 3D captur- ing devices, the “2D-plus-depth” format becomes common and important, as it can be used to generate new views1for 3DTV.

Although a tremendous number of binocular videos have come into existence, with only two views, it is still very difficult to compute reliable and consistent depth in long sequences. Structure-from-motion (SFM) and multi-view stereo matching can be applied to static scenes where global constraints are established through the multi-view geome- try (Snavely et al.2006; Furukawa and Ponce2007; Zhang et al.2009). It is not suitable for videos that contain mov- ing or deforming objects, which handicap correspondence establishment across multiple frames.

In optical flow estimation (Baker et al.2011), which cap- tures 2D apparent motion, correspondence between consec- utive frames can be established. 2D optical flow and depth variation are jointly considered in Patras et al. (1996), Zhang and Kambhamettu (2001), Vedula et al. (2005), Huguet and Devernay (2007), Wedel et al. (2008,2011), Valgaerts et al. (2010), Rabe et al. (2010), which is typically referred to as image scene flow. Given intrinsic camera parame- ters, 3D scene flow can be constructed (Basha et al.2010;

Wedel et al.2011). These methods either compute motion and depth independently or resort to a four-image configu- ration. They do not tackle the temporal-consistency preser- vation problem in multiple frames.

12D-Plus-Depth: (2009). Stereoscopic video coding format.http://en.

wikipedia.org/wiki/2D-plus-depth.

(2)

Local temporal constraints were imposed in depth esti- mation for video sequences, known as spatiotemporal stereo matching (Zhang et al. 2003; Richardt et al.2010). These methods do not cope with object motion across multiple frames and thus are more suitable for static scene videos.

In optical flow estimation, the methods presented in Black (1994), Bruhn and Weickert (2005) have temporal terms.

They, however, may suffer from two main estimation prob- lems. First, it is difficult or inefficient to perform long range information propagation temporally. Second, erro- neous estimates caused by occasional noise, sudden lumi- nance change, and outliers in one frame could influence later results. There is no effective way to measure and reduce es- timation errors globally.

We aim at reliable depth and motion estimation from multi-frame binocular videos with appropriate temporal consistency. Our method is not based on global multi-view geometry because dynamic objects do not obey them. We also do not count on the locally established temporal con- straint due to its inefficiency in information propagation.

We make several major contributions to construct the system, which can measure and reduce estimation errors in multiple frames. (1) We propose motion trajectories that link reliable corresponding points among frames. It is robust against occasional noise and abrupt luminance variation.

(2) We build structure profiles by considering multi-frame edges. Through a voting-like step, only edges reliable in multiple frames are enhanced. (3) Long-range temporal con- straints are advocated, based on the robust motion trajecto- ries. Regression then corrects errors and improves estimates temporally. (4) Last but not least, we propose anisotropic smoothing to non-uniformly regularize pixels, incorporat- ing temporal edge information and preventing unconstrained boundary degradation.

2 Related Work

Simultaneous depth and motion estimation from stereo im- ages was studied in Zhang and Faugeras (1992). In Wedel et al. (2008), motion and depth were computed sequen- tially and independently, assuming that the depth in previous frames is known. To improve the results, depth and motion are jointly estimated, using two stereo pairs (Patras et al.

1996; Zhang and Kambhamettu2001; Min and Sohn2006;

Huguet and Devernay 2007; Valgaerts et al.2010). These approaches estimate motion fields from two calibrated cam- eras, where constraints are established in the four-frame configuration. Temporal consistency may still be a problem.

Recently, Wedel et al. (2011) pointed out that decou- pling disparity and motion field computation is advanta- geous in that different optimization techniques can be ap- plied. A semi-dense scene flow field was computed in

Cech et al. (2011) through locally growing correspondence seeds. Instead of modeling image scene flow, Basha et al.

(2010) imposed constraints on 3D surface motion. Cali- brated multi-view sequences were used. Rabe et al. (2010) applied Kalman filtering to independently computed flow estimates and disparities. Hadfield and Bowden (2011) pro- posed a particle approach for scene flow estimation, mak- ing use of depth sensors and the monocular image camera in Microsoft Kinect. Vogel et al. (2011) incorporated a lo- cal rigidity constraint to regularize scene flow estimation.

Our method primarily differs from these approaches in the way of enforcing temporal consistency, as multi-frame long- range structure information is made use of.

Efforts have also been put to motion/depth discontinu- ity handling. Zhang and Kambhamettu (2001) used seg- mentation and applied piecewise regularization. Xu et al.

(2008) applied segmentation model to optical flow estima- tion. Edge-preserving regularizer (Min and Sohn2006), im- age driven regularizer (Sun et al.2008), and complementary regularizer (Zimmer et al.2009) were used to preserve mo- tion and depth boundaries. The color edges or segments that are used as guidance are generally hard to be consistent over time, making producing high-quality depth boundary diffi- cult.

In optical flow, two-frame configuration is common (Brox et al.2004; Bruhn et al.2005; Zimmer et al.2009;

Xu et al.2010). A taxonomy of optical flow methods, along with comparisons, is reported on the Middlebury website (Baker et al. 2011). To enforce temporal smoothness, in Brox et al. (2004), Bruhn and Weickert (2005), Bruhn et al.

(2005), constraints are yielded by assuming that the flow vectors from consecutive frames at the same image loca- tion are similar. It also applies to smoothly-varying motion.

Álvarez et al. (2007) enforced symmetry between the for- ward and backward flow to reject outliers. Assuming con- stant motion acceleration, Black (1994) enforced temporal smoothness by predicting motion for the next frame using current estimate. These methods only consider consecutive frames, which are not effective and may accumulate errors when propagating information among frames that are far apart.

Using multiple frames, Irani (2002) projected flow vec- tors onto a subspace and assumed that the resulting ma- trix has a low rank for noise removal. Global motion in static scenes is considered. To construct chained motion trajectories, particle samples were generated and linked in a probabilistic way (Sand and Teller 2008). In Sundaram et al. (2010), chained trajectories were constructed by bi- directional check of motion vectors based on large displace- ment optical flow estimation (Brox et al.2009). The quality of trajectories depends excessively on flow estimate from consecutive frames, making these methods possibly vulner- able to noise and estimation outliers.

(3)

Fig. 1 x correspondences in different frames

To achieve temporal consistency in depth estimation, Zhang et al. (2003) proposed spacetime data costs that aggregate data term over a short period. Richardt et al.

(2010) extended the idea by applying spatio-temporal cross- bilateral grid on the data cost, derived from the locally adaptive support aggregation window of Yoon and Kweon (2006). Temporal relationship is also considered upon nearby video frames. These methods did not deal with large motion between frames and thus are more suitable for static scenes.

Contrary to all these approaches, we propose a general binocular framework addressing the temporal consistency problem in depth and motion estimation in long sequences.

Temporal information from multiple frames is incorporated with novel chained profiles.

3 Notations and Problem Introduction

Given a rectified binocular video, our method computes two- view stereo for each frame pair across the two sequences, to- gether with dense motion in each sequence. Our framework can be readily extended to unrectified stereo videos linked with a fundamental matrix (Valgaerts et al.2010).

We denote corresponding frames in the stereo sequences as fltand frt, indexed by time t where t= {0, 1, . . . , N −1}.

For each pixel x in frame flt, we find the corresponding pix- els in the neighboring frames either temporally using mo- tion estimation or spatially with stereo matching, as shown in Fig.1. The correspondence xrt in frt is expressed as xrt= x + dtl(x),

where dtl is the view-dependent disparity. In this paper, we alternatingly use dtl(x)and dtlto represent the disparity value at point x. Meanwhile, optical flow correspondence in the left sequence for pixel x is

xlt+1= x + ut,tl +1,

based on the displacement u in frame flt+1. The 2D vec- tor ut,tl +1is written as ut,tl +1= (ut,tl +1, vt,tl +1)T, as shown in Fig.1. Finally, the correspondence for x in frt+1 is ex- pressed as xrt+1= x + ut,tl +1+ dt+1l , involving both motion and stereo. 3D image scene flow is, by convention, denoted as

st,t+1=

ut,tl +1, vt,tl +1, δdt,tl +1T

,

where δdt,t+1l = dtl+1(x+ ut,t+1l )− dtl. This representation includes spatial shift and depth variation for correspon- dences in successive two frames. In this paper, both “depth”

and “disparity” are used to denote the displacement of pix- els in two views, although, strictly speaking, depth is pro- portional to the reciprocal of disparity.

The above correspondences suggest a few fundamental constraints that are used in spatial-temporal depth estimation for every neighboring four frames (illustrated in Fig.1). The four commonly used scene flow conditions are

EF =

k

Γ

fr[k]t x+ dtl

− fl[k]t(x)2 ,

EL=

k

Γ

fl[k]t+1

x+ ut,tl +1

− fl[k]t(x)2 ,

EB=

k

Γ

fr[k]t+1

x+ ut,tl +1+ dtl+ δdt,tl +1

− fl[k]t+1

x+ ut,tl +12 , ER=

k

Γ

fr[k]t+1

x+ ut,tl +1+ dtl+ δdt,tl +1

− fr[k]t

x+ dtl2

, (1)

where[k] indexes channels and Γ (·) is the robust Charbon- nier function, i.e., the variant of L1 regularizer, written as Γ (y2)=

y2+ 2to reject matching outliers. EF and EB

are stereo constraints and ELand ERare motion constraints for the neighboring-four-frame set. In what follows, we omit the subscript l for all left-view unknowns for simplicity’s sake.

In Eq. (1), each f has 5 channels as adopted in Sand and Teller (2006), i.e., f = (fI,1/4(fG− fR),1/4(fGfB), f∂h, f∂v), to make the following computation slightly more robust against illumination variation, compared to only considering RGB colors. fI is the image intensity; f∂hand f∂vare horizontal and vertical intensity gradients.

Key Issues Simultaneously estimating all unknowns, i.e., depth and scene flow, is computationally expensive. It takes hours for the variational method (Huguet and Devernay 2007) to compute one scene flow field for one frame. We also found that only using these constraints cannot produce

(4)

Fig. 2 Depth/scene flow estimation example. (a) and (e) are two corresponding frames in a binocular video. (b)–(d) show the depth estimates from the joint method of Huguet and Devernay (2007). (f) and (g) are our depth results. (h) visualizes the 3D scene flow

temporally consistent results. As briefed in the introduction, only connecting very close frames lacks representation abil- ity to describe the relationship among correspondences that are far apart in the sequence. This deficiency could cause devastating failure in estimation.

We show one example in Fig. 2 where (a) and (e) are two stereo frames in the binocular sequence. (b)–(d) contain depth maps estimated using the joint method of Huguet and Devernay (2007). Even by simultaneously considering the motion and stereo terms, along with the L1 regularization, the results are not temporally very consistent.

This is explainable: when all correspondences are locally established between successive frames, they are vulnerable to noise, occlusion and illumination variation. Unlike multi- view stereo, there is no global constraint to find and elimi- nate errors over the sequence. Even with the temporal con- straints, when one depth estimate is problematic, all follow- ing computation steps can be affected, inevitably accumulat- ing errors and eventually failing estimation. In light of this, other considerations should be taken especially for long se- quences.

We propose new chained temporal constraints to make long-range depth-flow estimation less problematic. We show in Fig.2(f)–(g) our depth results and in (h) our 3D scene flow estimate. Their quality is very high. In what follows, we use gray-scale values to visualize disparity maps and 3D arrows to visualize image scene flow. 2D optical flow is color-coded according to the wheel in Fig.2(h), in which hue represents motion direction and intensity indicates flow magnitude. 3D scene flow vectors are similarly color coded for their first two dimensions.

Algorithm 1 Outline of Our Method INPUT: a rectified stereo sequence

1. Initialize motion fields and disparity maps.

(Sect.4.1)

2. Establish temporal constraints. (Sect.4.2) 2.1 Build robust trajectories.

2.2 Compute edge occurrence maps.

2.3 Compute trajectory-based depth and motion using robust regression.

3. Refine depth and scene flow with global temporal constraints. (Sect.4.3)

OUTPUT: disparity maps and scene flow

4 Our Approach

Our method is to generate consistent depth and scene flow from only a binocular video, utilizing pixel correspondence among multiple frames. The overview of our system is given in Algorithm1, which consists of main steps of initializa- tion, temporal constraint establishment, and final depth and joint scene flow refinement.

4.1 Initialization

To initialize depth and motion, by convention, we apply the variational method to optical flow estimation and use dis- crete optimization for two-view stereo matching, the latter of which is capable of estimating large disparity.

(5)

Fig. 3 Initial depth and flow. (a)–(b) Depth maps for two consecutive frames. (c) Close-ups of (a) and (b) in a top-down order. (d)–(e) Initial color-coded optical flow fields

Disparity Initialization We compute the disparity dtl for each frame t by optimizing

E0 dt

=



Ω

EF dt

+ βdESd

∇dt

dx, (2)

where x is defined on Ω, domain of x in the 2D image grid.

EF(dt) is given in Eq. (1). ESd(∇dt)is the truncated L1

function to preserve discontinuity, written as ESd(∇dt):=

min(|∇dt|, ρ), where ρ = 3. It is a regularization term to preserve edges. βd is a weight. With the discrete energy function, we solve Eq. (2) using graph-cuts (Kolmogorov and Zabih2004). Occlusion is further explicitly labeled us- ing uniqueness check (Scharstein and Szeliski2002). When two pixels in the left view are mapped to the same one in the right view, the pixel with smaller disparity is set as occlu- sion, with od(x)= 0.

Disparity maps for two consecutive frames are shown in Fig.3(a)–(b). Textureless regions, such as the mountain, and region boundaries have inconsistent estimates. Close- ups of the depth boundaries are shown in (c). The incon- sistent depth values cause flickering. See our supplementary video for the depth sequence.2

Optical Flow Initialization We initialize 2D motion in a variational framework. Both the forward and backward flow vectors are computed, denoted as ut,tl +1and utl+1,t, for out- lier rejection. For the following robust estimation, which is detailed in Sect.4.2, we also compute bi-directional flow be- tween frames flt and flt+2 (denoted as ut,tl +2 and utl+2,t), and between frames flt and flt+3 (denoted as ut,tl +3 and utl+3,t). Figure4illustrates these vectors. As all motion vec- tors are computed similarly, we only describe estimation of ut,tl +1. It is achieved by minimizing

E0 ut,tl +1

=



Ω

EL ut,tl +1

+ βuESu

∇ut,tl +1

dx, (3)

where ESu(∇ut,tl +1) is the total variation regularizer, ex- pressed as



∇ut,tl +12+ 2to preserve edges. It is a con-

2http://www.cse.cuhk.edu.hk/%7eleojia/projects/depth/.

Fig. 4 Flow vector illustration

vex penalty function and is commonly used in the varia- tional framework. βuis a weight controlling the smoothness of the computed flow fields. Equation (3) is optimized by the efficient method of Brox et al. (2004). The initially es- timated optical flow for two consecutive frames is shown in Fig.3(d)–(e).

4.2 Chained Temporal Priors

A part of our contribution lies on constructing motion trajec- tories after initialization, which link corresponding pixels in several frames, and on proposing new structure profiles, es- sential in our system to form new temporal constraints.

Robust Motion Trajectory Estimation For each pixel, to find its correspondences in other frames, we build motion trajectories based on the optical flow estimate in each se- quence. Note that motion vectors do not necessarily con- tain integers and thus may link sub-pixels. Specifically, x+ ut,t+1(x)in ft+1 that is mapped from x in ft based on the motion vector ut,t+1(x)is possibly a fractional value, locating in between four pixels, as shown in Fig.5(a). When searching for the correspondence of x+ ut,t+1(x)in frame t+ 2 or back in frame t, the motion vector for x + ut,t+1(x) has to be estimated, generally by spatial interpolation.

Here we propose a simple and yet effective way to im- prove interpolation accuracy. We observe in our experi- ments that simple distance-based interpolation, e.g., bilinear or bicubic method, could produce erroneous results. Fig- ure5(a) shows one example that the orange and green re- gions undergo motion in different directions. A point pro- jected in between pixels{y0, y1, y2, y3}, after bilinear inter-

(6)

Fig. 5 (a) Illustration of a flow interpolation issue. (b) Interpolation results on two patches. For each of them, bilateral and bilinear interpo- lation results are shown on top and bottom respectively. (c)–(d) ut,t−5 and ut,t+5 based on our trajectories. Occlusion is labeled as black.

(e) ut,t+5based on our trajectories. They are produced not considering long-range motion ut,t+2l (x)and ut,t+3l (x). Errors are larger than those in (d)

polation, is with near zero motion magnitude, which is ob- viously inappropriate. This problem is quite common for se- quences containing dynamic objects. We ameliorate it by in- corporating the color information to guide interpolation bi- laterally together with the spatial distance, originated from spatial bilateral filtering (Tomasi and Manduchi1998). The operator is written as

ut+1,t

x+ ut,t+1(x)

= 1

|w|

3 i=0

ut+1,t(yi)·

e−(x+ut,t+1(x)−yi)21−(fIt(x)−fIt+1(yi))22, (4) where t can be t , t + 2, or other frame indexes depend- ing on the motion definition.|w| is for normalization. The term (fIt(x)− fIt+1(yi))22considers the brightness sim- ilarity of points in different frames. σ1and σ2are set to 0.4 and 0.3 respectively. The comparison of bilateral interpola- tion and standard bilinear interpolation is given in Fig.5(b).

In the two motion field patches that contain dynamic ob- ject boundaries, it is clear that the bilateral method produces much sharper motion boundaries.

With this interpolation scheme, we link corresponding points among frames, which forms motion trajectories. Due to inevitable estimation errors in occlusion regions, object boundaries, and textureless regions, we identify and exclude outliers with bidirectional flow vectors (Huguet and Dever- nay2007). In particular, we project x+ ut,t+1(x)in ft+1, which is mapped from x in ft based on the motion vector ut,t+1(x), back to ft. We sum the two vectors with opposite directions and define the map ou, to mark glaring errors, as

ou(x)=

0 |ut,to +1| ≥ τ

1 otherwise (5)

where

ut,to +1= ut+1,t

x+ ut,t+1(x)

+ ut,t+1(x),

and τ is the error threshold set to 1. Satisfying the inequality

|ut,to +1| ≥ τ means the motion vectors that are supposedly opposite are malposed. We in this case discard ut,t+1(x)and set ou(x)to 0.

Removing a problematic flow vector shortens a motion trajectory. If it is too short, insufficient temporal informa- tion could be resulted in. With the observation that many outliers are caused by occasional noise and pixel color vari- ation, which do not present consistently in frames for the same pixel in general, we also utilize longer-range bidirec- tional flow vectors, i.e., ut,t+2l (x) and utl+2,t(x), as illus- trated in Fig.4. If they are valid after going through the same bidirectional consistency check, we reconnect the trajectory from frame t to t+ 2. Otherwise, we continue to test the pair of ut,tl +3(x)and utl+3,t(x). Only if all these three checks fail, we break the trajectory. The algorithm to build a motion tra- jectory is detailed in Algorithm2.

With these trajectories built in the forward and backward directions, we can find a series of correspondences in other frames for each pixel xt in frame t . The motion vector w.r.t.

xt and the correspondence in frame t± i is expressed as ut,t±i, which is the sum of all consecutive motion vectors in the trajectory from frame t to t± i for xt.

Figure 5(c)–(d) show respectively motion fields ut,t−5 and ut,t+5 produced by Algorithm2. Unreliable matching is marked as black (NaN in Algorithm2). They are typically caused by motion occlusion, leading to break of motion tra- jectories. This type of disconnection is however desirable because occlusion shortens trajectories by nature. Results in (c)–(d) also demonstrate that one pixel is occluded at most along one direction, but not both.

(7)

Algorithm 2 Robust Trajectory Building INPUT:{ut,t±1}, {ut,t±2}, {ut,t±3} for i= 2 to n do

for coordinate x do

if ou(ut±(i−1),t±i,ut±i,t±(i−1), x)then

ut,t±i(x) = ut±(i−1),t±i(x + ut,t±(i−1)(x)) + ut,t±(i−1)(x).

else if ou(ut±(i−2),t±i,ut±i,t±(i−2), x)then

ut,t±i(x) = ut±(i−2),t±i(x + ut,t±(i−2)(x)) + ut,t±(i−2)(x).

else if ou(ut±(i−3),t±i,ut±i,t±(i−3), x) and i > 2 then

ut,t±i(x) = ut±(i−3),t±i(x + ut,t±(i−3)(x)) + ut,t±(i−3)(x).

else

ut,t±i(x)= NaN.

end if end for end for

OUTPUT:{ut,t±l}, l ∈ {1, 2, . . . , n}

Note: NaN represents invalid motion vectors.

To demonstrate the effectiveness of longer-range bidirec- tional flow ut,tl +2(x)and ut,tl +3(x)in generating trajectories, we show a comparison in Fig.5(d) and (e), where (e) is pro- duced with trajectory construction not using long-range flow and only relying on ut,tl +1. Less black pixels are in the flow field (d) due to higher robustness against occasional noise and estimation outliers.

Trajectory-Based Structure Profile To build a practical system with constraints applied across multiple frames, be- sides trajectories, we also base our computation on a cen- tral observation—that is, motion and disparity boundaries are mostly in line with image structure boundaries. Salient edge maps, in this regard, are useful in shaping motion and disparity.

Unfortunately, depth and flow edges are very sensitive to noise, blurriness, illumination variance, and other kinds of image degradation. When taking sequences into consid- eration, boundaries of dynamic objects vary over time and are usually composed of different sets of pixels. It is com- mon knowledge that only using image gradient information for frames separately can hardly infer reliable and consistent edges, as illustrated in Fig.6(a).

Our goal in this part is to compute a series of salient structures that are temporally consistent and spatially con- spicuous, regardless whether they are on dynamic objects or along illuminance variation boundaries. To this end, we first calculate edge magnitude maps separately for each frame after bilateral smoothing to remove a small degree of noise.

The magnitude operator is



f∂h2 + f∂v2. Then we compute

Fig. 6 Illustration of structure profiles. (a) Single-image edge extrac- tion. The edges are weak and inconsistent in the two frames. (b) Our structure profile that is temporally more consistent

Fig. 7 Trajectory-based structure profile construction

an edge occurrence map Cft by simply setting pixels with their magnitudes smaller than a threshold (generally set to 0.01) to zeros. It actually indicates the occurrence of sig- nificant structure edges. All maps in the input sequence, to- gether with the computed dense motion trajectories, are used to establish a structure profile map for each frame.

For each pixel x in frame t , we project all edge occur- rence values in other frames, according to the correspon- dences along the trajectories{ut,t+i} and {ut,t−i}, to it, as illustrated in Fig.7. In this process, we can find consistent edge occurrence values where errors, after a voting-like pro- cess, can be quickly suppressed.

The corresponding point of x in frame t + i is x + ut,t+i(x)after chain projection, where ut,t+i(x)is the over- all motion vector. The average of the occurrence value in the trajectory is expressed as

˜Cft(x)=1 n



i

Cft+i

x+ ut,t+i(x)

, (6)

(8)

where n is the number of corresponding pixels along the tra- jectory. The structure profiles ˜Cft(x)embody statistics of the occurrence of strong edges over multiple frames. Its value reveals the chance that the current pixel is on a consistent edge.

We do not introduce weights in Eq. (6) because true edges can typically exist in a large number of frames consistently while a false one caused by noise and estimation errors does not. By projecting all correspondences to the current pixel and adding their occurrence values, a true edge point can gather a large confidence value. Occasional outliers can- not receive consistent support temporally, and therefore only have small confidence. In this voting-like process, originally weak but consistent edges can be properly enhanced.

The resulting edge occurrence maps are{ ˜Cft}, which are used to define edge priors. Figure6(b) shows two edge oc- currence maps, where inconsistent edges are notably weak- ened and consistent ones are enhanced. This profile con- struction process is robust against noise and sudden illumi- nation change.

Trajectory-Based Depth/Motion Profile Another set of im- portant profiles are constructed based on the fact that ˜Cf

does not contain scene flow information. Spatial-temporal constraints were proposed in Black (1994), Bruhn and We- ickert (2005), Bruhn et al. (2005) to enforce temporal con- sistency of motion vectors. Locally constant speed (Bruhn and Weickert 2005; Bruhn et al. 2005) and acceleration (Black1994) are typical assumptions. We use a temporal- linear model to fit motion and depth, in order to reject out- liers while allowing for depth and motion variation.

Here, we describe our depth profile estimation procedure.

Motion profile can be computed similarly. For each pixel, we adopt a linear parametric model to fit depth after projecting values from other frames to t temporally based on our tra- jectories. The linear model is controlled by two parameters w0and w1, representing depth offset and slope. Regression needs to minimize the energy



i

γi(x)

w1(x)i+ w0(x)− 1 dt+i

2

, (7)

where i indexes frames in the trajectory and γi(x)is the weight for the (t+ i)th frame and 1/dt+iis the correspond- ing depth for x in frame t+ i. With sub-pixel point position, we interpolate depth using bilateral weights described in Eq.

(4). γi(x)plays an important role and is defined as γi(x)= e−i2t· od(x),

which embodies two parts. They are respectively tempo- ral weight e−i2t in a Gaussian window to reduce the in- fluence of frames far away from the current frame t with σt = 10, and depth occlusion od(x), which is labeled using

Fig. 8 Computed w0and w1maps for one frame

the uniqueness check (Scharstein and Szeliski2002). Zero od(x)indicates occlusion, which correspondingly decreases the weight γi(x)to zero.

Before regression in Eq. (7), we check the sum

iγi. If

iγi <3, we skip regression for the robustness’ sake.

Equation (7) provides an effective way to gather statistical disparity information from multiple frames without much occlusion influence. As the initial disparity values are op- timized in a global fashion in each frame, the robust regres- sion process actually has incorporated neighboring disparity information for each pixel.

In minimization, taking derivatives w.r.t. the parameters w0(x)and w1(x)and setting them to zeros yield two equa- tions. After a few simple algebraic operations, closed-form solutions are obtained as

w0(x)=

iγi(x)i·

i γi(x)i

dt+i

iγi(x)i2

i γi(x)

dt+i

(

iγi(x)i)2− (

iγi(x)i2·

iγi(x)) , (8) w1(x)=

iγi(x)i·

i γi(x)

dt+i

iγi(x)

i γi(x)i

dt+i

(

iγi(x)i)2− (

iγi(x)i2·

iγi(x)) . (9) Finally, given the estimated linear parameters, the depth pro- file ˜dt(x)for x in frame t is written as

˜dt(x)= 1

w0(x). (10)

We show the w0and w1results in Fig.8. w0corresponds to scene depth. It is made temporally more consistent after optimization thanks to the regression to reject random out- liers and preserve boundaries. The average magnitude of the w1 map is much smaller than that of w0, manifesting that depth does not undergo abrupt change over time.

For motion profile computation, we similarly apply the linear model, yielding

˜ut(x)=

iγi(x)i

γi(x)iut+i

iγi(x)i2

iγi(x)ut+i (

iγi(x)i)2− (

iγi(x)i2·

iγi(x)) ,

(11) where˜ut(x)is the motion profile for pixel x in frame t and γi(x)= e−i2t · ou(x).

(9)

Algorithm 3 Depth and Scene Flow Computation

INPUT: depth profile {˜ut,t+1}, motion profile {˜dt}, se- quences{fl,r0, fl,r1,·, fl,rN−1}

Update depth{˜dt} to {dt} for all frames with the temporal constraint (Sect.4.3.1).

for frame i= 0 to N − 1 do

Optimize scene flow si,i+1 based on (di,di+1,˜ui,i+1) (Sect.4.3.2).

end for

OUTPUT: temporally consistent disparity maps{dt} and scene flow{st,t+1}

ou is motion occlusion estimated through bidirectional check. As shown in Fig.5(c)–(d), occlusion generally arises along one direction—that is, either forward or backward—

but not both. So the situation that ou(x)= 0 for all cor- respondences of one pixel in the trajectory seldom occurs, making it always possible to find points to gather statistics and refine motion during regression. We adopt a small σt

(set to 3) due to the fact that motion variation is typically larger than the change of scene depth.

4.3 Temporally-Constrained Depth and Scene Flow In this section, we describe the central steps to estimate depth and image scene flow given the temporal constraints.

We find that estimating depth and scene flow in the same pass is computationally expensive (Wedel et al. 2011) and unstable due to the fact that depth and scene flow are com- pletely different variables by nature. Disparity can have very large values (up to tens or hundreds in the pixel scale) while scene flow captures object position variation and thus has much smaller scales. Putting them together makes varia- tional optimization difficult to perform satisfyingly and be easily stuck in local optima. As reported in Huguet and De- vernay (2007), Wedel et al. (2011), a full joint procedure to estimate depth and scene flow takes pretty long time.

It is also notable that the initial disparity values estimated frame-by-frame are lack of sub-pixel accuracy, unsuitable for δdt,t+1 estimation in scene flow. With these concerns, we decouple depth and scene flow, and optimize the dispar- ity sequence with the long-range temporal constraint. Scene flow is then updated. The algorithm is outlined in Algo- rithm 3. We describe below the spatio-temporal functions to constrain depth and scene flow.

4.3.1 Consistent Depth Estimation

To refine depth, we minimize Ef

dt

=



Ω

ˆod(x)EF dt

+ αdET dt

+ βdES

∇dt dx,

(12)

where αd and βd are two weights. EF(dt)is the data cost defined in Eq. (1), which relates two views at time t . The oc- clusion variableˆod(x)helps reduce the adverse influence of occasional occlusion. We define ˆod(x)= max(od(x),0.01), for the sake of numerical stability. ET(dt)is the temporal depth data term, defined as

ET dt

=

dt− ˜dt2

, (13)

where ˜dt is the fitted depth profile. This seemingly simple term is essential in our method because it incorporates long- range temporal information from multiple frames. EF, on the contrary, is only a local frame-wise data term.

The structure profile is incorporated in depth regulariza- tion to enforce structure consistency among frames. We only impose smoothness for regions with small edge-occurrence values in ˜Cft and allow depth discontinuity to take place when ˜Cft(x)is large. On account of possible subpixel er- rors in averaging the edge-occurrence maps, edges in ˜Cft could be slightly wider than what they should be, as shown in Fig.6. It is inappropriate to naively enforce no or small smoothness for these pixels because, without necessary reg- ularization, edge cannot be well preserved.

We turn to an anisotropic smoothness method (Xiao et al.

2006; Sun et al.2008; Zimmer et al.2009) to provide crit- ical constraints for edge-preserving regularization. We de- compose depth gradient∇d into {∇d,∇d} according to the image gradient, where

∇d=

∇d, ∇fI

· ∇fI,∇d= ∇d − ∇d.

where∇fIis the unit vector parallel to the frame intensity gradient∇fI, i.e.∇fI=∇f∇fII. We propose the following function for anisotropic smoothness regularization:

ES(∇d) = Γ

∇d2

+ (1 − ˜Cf)

∇d2

, (14)

where Γ (·) is the robust Charbonnier function, defined in Eq. (1). In Eq. (14), for all pixels, smoothness is enforced along the isophote direction while discontinuity along gra- dient is allowed only for reliable strong edges, which cor- responds to large ˜Cf. Hence, Eq. (14) provides necessary constraints in different directions. We note here that strong edges caused by occasional outliers in one frame would not affect the anisotropic regularizer. On the other hand, com- pared with Cf, temporally consistent edges have been en- hanced in ˜Cf, which boost the anisotropic properties ac- cordingly.

With a few algebraic operations, Eq. (14) can be written in a form of diffusion tensor

ES(∇d) = Γ

∇dTD(∇fI)∇d

, (15)

(10)

Fig. 9 Effectiveness of anisotropic regularization. (a) Our depth esti- mate{dt}. (b) Top to bottom and left to right: patch of input image ft,

˜Cft, and depth results using the isotropic and anisotropic regularization terms, both guided by ˜Cft

where D(∇fI)is the diffusion tensor defined as D(∇fI)=

∇fI

∇fI

T

+ (1 − ˜Cf)

∇fI

∇fIT , where∇fIand∇fIare two unit vectors parallel and per- pendicular to∇fI, respectively. Equation (12) can be effi- ciently minimized using variational solvers to enable sub- pixel accuracy. Note that the inherent difficulty to solve for large displacements in the variational framework is greatly reduced with our initial estimate ˜dt, obtained by ro- bust regression. Our energy minimization is discussed in Sect.4.3.3.

A depth map result is shown in Fig.9(a), with the com- parison in (b). The bottom left subfigure of (b) is the depth map obtained by enforcing smoothness uniformly in all di- rections with strength (1− ˜Cft). When ˜Ctf is large near the boundaries, the depth estimation is ill-posed, resulting in a problematic map. Our result with anisotropic regularization preserves much better edges.

4.3.2 Scene Flow Estimation

We now estimate 2D motion and depth variation with nec- essary temporal constraints for image scene flow.

Data Fidelity Term Our new data cost is given by ED

ut,t+1, δdt

= ˆodˆouEB

ut,t+1, δdt,t+1 +

ˆouEL

ut,t+1

+ ˆodˆouER

ut,t+1, δdt,t+1

+ αuˆouET d δdtl

+ αuET u ut,t+1

, (16)

where EB, EL and ER are the traditional scene flow con- straints, defined in Eq. (1).{dt} is the disparity computed in the above estimation step (described in Sect.4.3.1). ˆou, where ˆou = max(ou,0.01), and ˆod mask out unreliable depth and flow primarily caused by occlusion. αu is the weight for the long-range temporal constraint.

ET dand ET uincorporate our new temporal profiles. ET d

is defined as ET d

δdt,t+1

=

δdt,t+1+ dt− dt+12

, (17)

where dt is a shorthand for dt(x) and dt+1:= dt+1(x+ ut,t+1(x)). They are disparities of x in the t -th and (t+1)-th frames, respectively. Our depth refinement makes δdt,t+1a sub-pixel value.

Similarly, the flow constraint is defined as ET u

ut,t+1

=

ut,t+1− ˜ut,t+12

. (18)

Smoothness Term In regularization, we penalize sudden and significant change of the 2D motion∇ut,t+1and of the disparity∇δdt,t+1w.r.t. the frame diffusion tensor, yielding an anisotropic smoothing effect with the edge maps as guid- ance. We define

ES

∇ut,t+1,∇δdt,t+1

= Γ

∇ut,t+1TD(∇fI)∇ut,t+1 + κΓ

∇δdt,t+1TD(∇fI)∇δdt,t+1

. (19)

Here κ is a weight set to 0.5. The final objective function for scene flow estimation with temporal constraints is given by Ef

ut,t+1, δdt,t+1

=



Ω

ED

ut,t+1, δdt,t+1 + βuES

∇ut,t+1,∇δdt,t+1

dx. (20)

We minimize it using the variational method detailed below.

4.3.3 Energy Minimization

Equations (12) and (20) can be solved in a coarse-to-fine framework using variational solvers. However, this proce- dure is found not necessary. It is because in our system, vari- ables are well initialized and estimated before optimization in the two stages. To compute depth using Eq. (12), the tem- poral depth profile is available. While jointly optimizing el- ements in scene flow using Eq. (20), the updated depth and motion profiles provide good initialization. Moreover, en- ergy minimization with proper initialization ameliorates the inherent estimation problems for large displacements (Brox et al.2009; Xu et al.2010), even in the original image reso- lution.

With this consideration, we perform Taylor series expan- sion on the data term and solve for the increments

Δd= d − d(0), Δs= s − s(0).

參考文獻

相關文件

Conducting binary morphological operations: dilation, erosion, opening, closing and hit-and-miss.. Dynamic gray morphological operation kernel assignment with color depth showing

In addition, three seminars were held and in-depth interviews with 20 public-sector organizations and 20 individuals in the target sample population were

● develop teachers’ ability to identify opportunities for students to connect their learning in English lessons (e.g. reading strategies and knowledge of topics) to their experiences

Map the elements of elective modules to the Compulsory Part of the school- based Senior Secondary EL curriculum?. Adjust the breadth and depth of learning

To enable the research team to gain a more in- depth understanding of the operation of the Scheme, 40 interviews were conducted, including 32 in eight case study

and3DWWHUVRQ*$  ³7KHG\QDPLFUHODWLRQVKLSRIYRODWLOLW\ volume, and market depth in currency futureV PDUNHWV´ JournalofInt ernat i onalFi nanci al... and

A spiral curriculum whose structure allows for topics or skills to be revisited and repeated, each time in more detail or depth as the learner gains in knowledge, skills,

projected texture Active depth from defocus Active depth from defocus Photometric stereo Photometric stereo.. time of flight time