適用於高畫質立體電視應用之視差估測設計研究

(1)

國

立

交

通

大

學

電子工程學系電子研究所

博士論文

適用於高畫質立體電視應用之視差估測設計研究

The Study of Disparity Estimation Design for High Definition 3DTV

Applications

研究生：曾宇晟

指導教授：張添烜教授

(2)

適用於高畫質立體電視應用之視差估測設計研究

The Study of Disparity Estimation Design for High Definition 3DTV

Applications

研究生：曾宇晟 Student：Yu-Cheng Tseng

指導教授：張添烜 Advisor：Tian-Sheuan Chang

國立交通大學

電子工程學系電子研究所

博士論文

A Dissertation

Submitted to Department of Electronics Engineering and

Institute of Electronics

College of Electrical and Computer Engineering

National Chiao Tung University

in partial Fulfillment of the Requirements

for the Degree of

Doctor of Philosophy

in

Electronics Engineering

August 2011

Hsinchu, Taiwan, Republic of China

(3)

i

適用於高畫質立體電視應用之視差估測設計研究

學生: 曾宇晟

指導教授: 張添烜

國立交通大學電子工程學系暨電子所博士班

摘要

隨著立體電視的問世，人們可以藉由立體視訊獲得新的視覺經驗。立體視訊可以立體攝影機擷取，並經由影像處理技術運算後，可支援多視角與自由視點之立體電視應用。在立體視訊的處理中，視差估測為最重要的技術之一。視差估測可產生拍攝場景之視差圖，可用於虛擬視角視訊的合成。動態影像壓縮標準組織的立體視訊編碼團隊已提出目前最先進視差估測演算法。其演算法可針對立體電視的應用產生高品質的視差圖，但因採用圖形切割演算法導致高運算複雜度與低平行運算的問題。特別對於高畫質視訊，其問題更為嚴重。為解決以上問題，本論文首先提出初階視差估測演算法，採用訊息傳遞演算法以提高視差估測的運算平行度，並搭配聯合雙邊上取樣演算法以減少運算的畫面大小。其硬體設計面臨之問題，可藉由所提出之硬體架構方法解決。以此初階演算法為基礎，我們進一步提出一高品質視差估測演算法，可改善時間軸一致性與遮蔽之問題，並產生高品質的視差圖。針對高品質視差演算法，我們提出適用於不同實作方法的二快速視差估測演算法。針對軟體程式設計，所提出的稀疏運算之快速演算法可藉由時間軸與空間軸的分析選擇稀疏像素，僅針對稀疏像素更新視差值，達到降低運算時間至 62.9%。另一方面，針對超大型積體電路設計，所提出的高硬體效率之快速演算利用新的比對資訊擴散方法可降低運算時間至 57.2%，並大幅降低原演算的記憶體成本至 0.00029%。客觀評比的結果顯示針對虛擬視角視訊合成之應用，我們所提出的演算法可達到近於現今最先進演算法的高品質。

(4)

ii

最後，我們化簡高硬體效率之快速演算法，進而提出高輸出效能的架構設計。其硬體實作結果顯示所提出的視差估測引擎可支援視差範圍 128，同時產生三視角 HD1080p 視差圖，並達到每秒 95 畫面的輸出速度，也就是每秒 75.64G 像素視差。總言之，本論文所提出的視差估測設計可滿足高畫質度立體電視應用的需求。

(5)

iii

The Study of Disparity Estimation Design for High Definition 3DTV

Applications

Student: Yu-Cheng Tseng

Advisor: Dr. Tian-Sheuan Chang

Department of Electronics Engineering & Institute of Electronics

National Chiao Tung University

ABSTRACT

With emerging 3DTVs, human can have new visual experience from 3D videos that can be

captured by new stereo camera and further processed by image processing techniques for the 3DTV

applications of multi-view or free viewpoint. In the 3D video processing, one of the most important

techniques is the disparity estimation that could generate disparity maps for synthesizing virtual-view

videos. The state-of-the-art disparity estimation algorithm proposed by the MPEG 3D Video Coding

team could deliver high-quality disparity maps, but suffers from high computational complexity and

low parallelism due to its graph-cut algorithm, especially for high definition videos.

To address the problems, this dissertation first proposes the baseline disparity estimation

algorithm that adopts the belief propagation algorithm to increase the parallelism of disparity

estimation, and the joint bilateral upsampling algorithm to reduce the computational resolution. Their

design challenges could be solved by our proposed architectural design methods. Based on the

baseline algorithm, we further propose the high-quality algorithm that could well improve the

temporal consistency and occlusion problems, and deliver high performance disparity maps. To

accelerate the high-quality algorithm, we propose the two fast algorithms for different implementation

method. The sparse-computation fast algorithm could decrease the processed pixels in the spatial and

(6)

iv

other hand, for the hardware implementation, we propose the hardware-efficient fast algorithm that

could reduce the execution time of high-quality algorithm to 57.2%, and decrease the memory cost of

belief propagation to 0.00029% by the proposed cost diffusion method. The objective evaluation

results show that our disparity quality is similar to the quality of state-of-the-art algorithm for view

synthesis applications.

Moreover, we further simplify the hardware-efficient algorithm and propose a high-throughput

architectural design. The implementation results shows that the proposed disparity estimation engine

could achieve the throughput of 95 frames/s for three view HD1080p disparity maps with 128

disparity levels (i.e. 75.64G pixel-disparities/s). It could satisfy the requirement of high definition

(7)

v

謝誌

從大學四年到研究所五年，交大給予我許多成長與回憶。在此，藉由博士論文的完成以感謝所有的人。首要感謝的是指導老師張添烜教授，自大三的專題研究至博士班研讀，不論在研究方法、論文撰寫與投稿皆給予我耐心的指導與建議。接著要感謝王聖智教授，在我大學四年級時指導我和嘉賓進行色盲專題研究，更推薦我得以逕讀博士班，並擔任我的博士學位口試委員。另外也感謝其他口試委員，包含楊家輝教授、李鎮宜教授、杭學鳴教授、蔣迪豪教授、蔡宗漢教授及林嘉文教授，願意撥空給予指導。工程四館 427 實驗室是我博士班在交大停留最久的地方，首要感謝的是張彥中學長，教導我良好的研究方法與態度，並引領我進入博士論文的研究題目。也感謝作最久實驗室同學的國龍，在實驗室的五年裡與我分享研究及生活，並一同朝著取得博士學位努力。接著要感謝實驗室的學長們:佑昆、朝鐘、君偉、裕仁、錦木、國亘、旻奇、子筠、嘉俊、英澤、得瑋、秈璟，傳授我硬體設計基本觀念，並營造實驗室和樂的氣氛。當然也感謝實驗室的同學:宗憲、景竹、瑋呈、瑋城，因為有你們實驗室總是歡笑不斷。另外，感謝和我一起合作的實驗室學弟妹們: 之悠、博淵、政君、孟維、筱珊、博雄、奕君、瑩蓉、宥辰、元歆、英佑、克嘉、亮齊、孟勳。最後要感謝我的家人和女友，從攻讀博士班的決定，資格考的準備，期刊論文審稿的等待，到博士論文的撰寫與口試，一路上有你們的支持與陪伴使我能夠取得此學位。此博士論文獻給以上所有感謝的人。

(8)

ii

List of Tables

Table II-1 Various match metrics for computing C0(x, y, d) ... 10

Table III-1 Comparison of memory cost in memory access approaches for the iteration count of 30 .. 55

Table III-2 Logic cost comparison of PE architectures ... 57

Table III-3 Implementation results of various BP-based algorithms ... 58

Table III-4 Comparison of BF acceleration approach in computational complexity and memory cost 59 Table III-5 Computational flow and analysis for a pixel in the integral histogram approach ... 63

Table III-6 Modified computational flow and analysis for a pixel in the integral histogram approach 70 Table III-7 Example implementation result of the proposed architecture ... 76

Table III-8 Comparison of hardware cost per frame ... 76

Table III-9 Previous VLSI implementations of bilateral filtering ... 77

Table III-10 Comparison of different implementations ... 77

Table IV-1 Simulation results with different sampling factors in Y-PSNR (dB) ... 94

Table IV-2 Comparison of execution time of HQ-DE and SC-DE algorithms ... 117

Table IV-3 Window sizes of filter-based processes in HQ-DE algorithm ... 120

Table IV-4 Comparison of memory requirement between BP-M and cost diffusion methods ... 124

Table IV-5 Window sizes of filter-based processes in HE-DE algorithm ... 128

Table V-1 Test sequences ... 134

Table V-2 Input and output views for 2-view configuration [71] ... 135

Table V-3 Input and out views for 3-view configuration [71] ... 135

Table V-4 Experiment setting in our evaluation ... 136

Table V-5 Average execution time of proposed algorithms on PC for one frame ... 137

Table V-6 Average execution time scaled to HD1080p resolution and disparity range of 128 ... 137

Table V-7 Evaluation results of Y-PSNR for View0 ... 139

Table V-8 Evaluation results of Y-PSNR for View8 ... 139

Table V-9 Evaluation results of SSIM for View0 ... 140

Table V-10 Evaluation results of SSIM for View8 ... 141

Table V-11 Evaluation results of T_PSPNR (dB) for View0... 142

Table V-12 Evaluation results of T_PSPNR for View8 ... 142

Table VI-1 Estimated average external bandwidth for computing four disparity rows. ... 191

Table VI-2 Performance of the proposed disparity estimation engine ... 196

Table VI-3 Internal SRAM usage in the proposed disparity estimation engine ... 196

Table VI-4 Internal registers in the proposed disparity estimation engine ... 197

Table VI-5 Area of the computational logic ... 197

Table VI-6 Comparison of our design and previous implementation ... 198

Table VI-7 Evaluation results of Y-PSNR for View0 ... 200

Table VI-8 Evaluation results of Y-PSNR for View8 ... 200

(12)

vi

Table VI-10 Evaluation results of SSIM for View8... 201 Table VI-11 Evaluation results of T_PSPNR (dB) for View0 ... 202 Table VI-12 Evaluation results of T_PSPNR (dB) for View8 ... 202

(13)

vii

List of Figures

Figure II-1 Epipolar geometry... 6

Figure II-2 Image planes with rectification ... 6

Figure II-3 Relation between disparity and depth for a pair of correspondences ... 7

Figure II-4 A general framework for disparity estimation algorithms ... 8

Figure II-5 Matching costs of a target pixel and its correspondence candidates ... 9

Figure II-6 Illustration of a cost cube ... 9

Figure II-7 Block-based matching cost with the block radius r ... 10

Figure II-8 Various cost aggregation approaches ... 12

Figure II-9 Concept of dynamic programming approach ... 14

Figure II-10 Graph model of graph-cut algorithm ... 15

Figure II-11 Graph model of belief propagation approach... 16

Figure II-12 General flow of view synthesis ... 19

Figure II-13 Warping methods in view synthesis ... 20

Figure II-14 Blending step in view synthesis ... 21

Figure II-15 Input and output view configuration defined by the 3DVC ... 23

Figure II-16 Flow of the DERS algorithm ... 24

Figure II-17 Data flow for 3-view configuration ... 27

Figure II-18 Example of temporal noise changing successive frames [76] ... 30

Figure II-19 Example of block matching in the DERS algorithm... 32

Figure III-1 Illustrations of BP ... 35

Figure III-2 Configuration of the message passing PEs ... 40

Figure III-3 Traditional fixed memory access approach in a 1-D node line for node n3 computation . 43 Figure III-4 Proposed spinning-message approach ... 44

Figure III-5 Proposed spinning-message approach in a 2-D node plane for node n3 computation ... 45

Figure III-6 Comparison of memory access approaches in different node planes ... 45

Figure III-7 Sliding node plane in different directions ... 46

Figure III-8 Sliding node plane with the spinning-message approach ... 47

Figure III-9 Bipartite node plane with the spinning-message approach ... 48

Figure III-10 Proposed sliding-bipartite node plane ... 49

Figure III-11 Pseudo code of the message passing for calculating a new message ... 51

Figure III-12 Architecture of Park’s PE ... 51

Figure III-13 Proposed architecture ... 53

Figure III-14 Ratio of memory cost in different node planes with spinning-message approach ... 56

Figure III-15 Classification of BF acceleration approaches ... 59

Figure III-16 Concept of histogram-based approaches ... 61

Figure III-17 Concept of integral histogram approach ... 64

(14)

viii

Figure III-19 Stripe-based method (SBM) ... 68

Figure III-20 Sliding origin method (SOM) ... 69

Figure III-21 Proposed architecture of JBF. ... 71

Figure III-22 Schedule of the proposed architecture ... 72

Figure III-23 Selected-bin adder in the histogram calculation engines ... 73

Figure III-24 Proposed architectures of histogram calculation engines hic and hcc ... 73

Figure III-25 Proposed architecture of (a) convolution engine and (b) its table selection modules ... 75

Figure III-26 Flow of the proposed baseline disparity estimation algorithm ... 78

Figure III-27 Experimental results of the baseline algorithm and the DERS algorithm ... 80

Figure III-28 Center disparity maps and synthesized View8 of baseline algorithm at the 100th frame 82 Figure III-29 Center disparity maps and synthesized View8 of DERS algorithm at the 100th frame ... 84

Figure IV-1 Flow of the adaptive-BP algorithm [39] ... 87

Figure IV-2 Flow of the double-BP algorithm [40] ... 88

Figure IV-3 An example of flicker artifact of the baseline algorithm in BookArrival ... 90

Figure IV-4 An example of foreground copy artifact of the DERS algorithm in BookArrival ... 90

Figure IV-5 An example of occlusion problem at the 44th frame of BookArrival ... 91

Figure IV-6 Flow of the HQ-DE algorithm for a center-view disparity map ... 92

Figure IV-7 Flow of the HQ-DE algorithm for a side view disparity map ... 93

Figure IV-8 Comparison of different sampling factors in the average Y-PSNR of two frames ... 94

Figure IV-9 Simulation results using the sampling factors of 1/2×1/4 and 1/4×1/4 ... 95

Figure IV-10 Illustration of downsampled disparity estimation for full disparity range ... 96

Figure IV-11 Comparison between the original regional vote [6] and the proposed window vote ... 99

Figure IV-12 Illustration of the proposed occlusion detection method ... 100

Figure IV-13 Results with and without the proposed occlusion handling method in BookArrival ... 102

Figure IV-14 Results of the HQ-DE algorithm in BookArrival compared to Figure IV-5 ... 102

Figure IV-15 Concept of the proposed no-motion registration (NMR) method ... 104

Figure IV-16 Results of the proposed NMR method in BookArrival ... 105

Figure IV-17 Results of the proposed NMR method in the 32th, 34th, 36th, 38th frames ... 105

Figure IV-18 Results of the proposed SEP method in BookArrival ... 106

Figure IV-19 Profiling of the HQ-DE algorithm on PC ... 108

Figure IV-20 Flow of the SC-DE algorithm for center-view disparity map ... 109

Figure IV-21 Flow of the SC-DE algorithm for side-view disparity maps ... 111

Figure IV-22 Flow of region detection for sparse pixel selection ... 112

Figure IV-23 Example of edge maps in BookArrival ... 113

Figure IV-24 Example of occlusion maps in BookArrival ... 113

Figure IV-25 Example of motion maps in BookArrival ... 115

Figure IV-26 Concept of sparse SSAD and sparse ADSW methods ... 115

Figure IV-27 Concept of sparse BP-M method ... 116

(15)

ix

Figure IV-29 Image buffer required by the SSAD and ADSW steps ... 119

Figure IV-30 Flow of the HE-DE algorithm for center view ... 122

Figure IV-31 Concept of BP-M computation ... 123

Figure IV-32 Concept of the proposed window-based SSAD method ... 127

Figure IV-33 Flow of proposed occlusion handling method in HE-DE algorithm ... 128

Figure IV-34 Flow of edge detection and motion detection in HE-DE algorithm ... 130

Figure V-1 Clips of test sequences in center view... 133

Figure V-2 Evaluation results of Y-PNSR ... 140

Figure V-3 Evaluation results of SSIM ... 141

Figure V-4 Evaluation results of T_PSPNR ... 143

Figure V-5 Disparity maps and view synthesized images in the 50th frame of BookArrival ... 145

Figure V-6 Disparity maps and view synthesized images in the 50th frame of LoveBird1 ... 147

Figure V-7 Disparity maps and view synthesized images in the 100th frame of Newspaper... 149

Figure V-8 Disparity maps and view synthesized images in the 50th frame of Café ... 149

Figure V-9 Disparity maps and view synthesized images in the 50th frame of Kendo ... 150

Figure V-10 Disparity maps and view synthesized images in the 100th frame of Balloons ... 151

Figure V-11 Disparity maps and view synthesized images in the 50th frame of Champagne... 153

Figure V-12 Disparity maps and view synthesized images in the 50th frame of Pantomime ... 155

Figure V-13 Disparity maps and view synthesized images in the 50th frame of Hall1 ... 156

Figure V-14 Disparity maps and view synthesized images in the 50th frame of Hall2 ... 157

Figure V-15 Disparity maps and view synthesized images in the 167th frame of CarPark ... 158

Figure V-16 Disparity maps and view synthesized images in the 50th frame of CarPark ... 159

Figure VI-1 Data dependency of the HE-DE algorithm ... 162

Figure VI-2 Required row buffers in filter-based processes for pipelining architecture ... 163

Figure VI-3 Memory buffers in the motion detection ... 164

Figure VI-4 Flow of the proposed HW-DE algorithm ... 165

Figure VI-5 Proposed motion detection in the HW-DE algorithm ... 167

Figure VI-6 Overview architecture of the proposed disparity estimation engine ... 169

Figure VI-7 Proposed computational schedule for main core ... 170

Figure VI-8 Architecture of the low-resolution disparity estimation stage ... 172

Figure VI-9 Data access of the motion detection module in the frame coordinate system ... 173

Figure VI-10 Architecture of the motion detection module ... 173

Figure VI-11 Input and required data in matching cost calculation for three target views ... 174

Figure VI-12 Architecture of the window-based SSAD and DPotts modules ... 175

Figure VI-13 Architecture of the temporal cost calculation module ... 176

Figure VI-14 Architecture of vertical cost diffusion module ... 177

Figure VI-15 Fully parallel architecture of the horizontal cost diffusion module ... 178

Figure VI-16 Architecture of the horizontal cost diffusion module ... 179

(16)

x

Figure VI-18 Architecture of the disparity cross warping module ... 181

Figure VI-19 Architecture of the occlusion detection PE and the warp filling PE ... 182

Figure VI-20 Architecture of the good disparity detection module ... 183

Figure VI-21 Architecture of border filling and inside filling modules ... 184

Figure VI-22 Architecture of the high-resolution disparity estimation stage ... 185

Figure VI-23 Memory configuration in the high-resolution disparity estimation stage ... 186

Figure VI-24 Architecture of the joint bilateral upsampling module ... 187

Figure VI-25 Architecture of the window vote module ... 188

Figure VI-26 Architecture of the mask and vote PEs for the window vote module ... 188

Figure VI-27 Architecture of the still-edge preservation module ... 189

Figure VI-28 Rough schedule for external memory access ... 190

Figure VI-29 Architecture of external memory in our design ... 191

Figure VI-30 Read and write latency in the SDRAM model [110] ... 192

Figure VI-31 Data configuration in external memory ... 194

Figure VI-32 Schedule of external memory access for one HD1080p frame at 800MHz... 195

Figure VI-33 Evaluation results of Y-PNSR ... 200

Figure VI-34 Evaluation results of SSIM ... 201

(17)

xi

List of Symbols

Symbols Descriptions

H, W Frame height, frame width

DR Disparity range

IS0,S1 S2

Image frame where

S0 could be H for high resolution, and L for low resolution,

S1 could be L for left view, C for center view, and R for right view, and S2 could be t for current frame, and t-1 for previous frame

DS0,S1 S2

Disparity map where

S0 could be H for high resolution, and L for low resolution,

S1 could be L for left view, C for center view, and R for right view, and S2 could be t for current frame, and t-1 for previous frame

C0 Initial cost cube computed by the matching cost calculation step

Caggr Cost cube computed by the cost aggregation step

Cview Cost for inter-view consistency constraint

Ctemp Cost for temporal consistency enhancement

Cvert Cost for vertical diffusion method

Ctotal Final Cost for disparity optimization

T Iteration count in belief propagation

D(d) Data term in disparity optimization process

(18)

(19)

1

I Introduction

1.1 Background

With the prompt development of 3-D display techniques, people could obtain the new visual

experience from 3-D videos, which have multi-view videos for left and right eyes. Compared to

traditional 2-D videos, 3-D videos could make human have the distance feeling of scene with the

additional video processes: calibration and rectification, multi-view video coding, disparity estimation,

and virtual view synthesis. For these 3-D video processes, the Moving Picture Experts Group (MPEG)

3-D Video Coding (3DVC) has delivered a basic 3DTV framework that consists of the depth

estimation reference software (DERS) [63], view synthesis reference software (VSRS) [64], and

Multi-view Video Coding (MVC) standard [107]. They also provide the multi-view video sequences

[71] for the performance evaluation. The basic 3DTV framework can be extended to various systems

such as the stereoscopic TV for multiple viewers and the free-viewpoint TV for a larger viewing zone

[100], [101].

For the basic 3DTV framework, the previous VLSI implementation of VSRS and MVC decoder

[61], [62] can reach the real-time performance for high definition videos. On the other hand, the DERS

could deliver high quality disparity maps but suffers from high computational complexity due to its

graph-cut optimization, especially for high definition videos. Therefore, it is necessary to develop a

disparity estimation engine that could deliver high quality disparity maps and achieve the real-time

performance for high definition videos.

1.2 Motivation

Many disparity estimation algorithms have been developed in computer vision for different

(20)

2

accuracy evaluation [72] shows that the graph-cut and the belief propagation approaches could

perform better than other kinds of approaches. Based on the graph-cut approach, the state-of-the-art

DERS algorithm delivered by MPEG 3DVC could generate high quality disparity maps for 3DTV

applications, but it still encounters the following problems. First, the temporal consistency problem is

not addressed well due to the foreground copy artifact. Second, its execution time will be dramatically

increased with the increasing video resolution and disparity range. For one HD1080p frame, it takes

more than 20 minutes in average on a personal computer. Third, the computation of graph-cut is

irregular and iterative, so that it is not suitable to be accelerated by the parallel computing PEs of VLSI

design or multi-core platform.

Motivated by the problems in the state-of-the-art disparity estimation algorithm, the goal of this

dissertation is to develop a new disparity estimation engine that could not only generate high quality

disparity maps, but also achieve the throughput of 60 frames/s for the HD1080p resolution to satisfy

the requirement of high definition 3DTV applications.

1.3 Contribution

To achieve the above goal, this dissertation develops a disparity estimation engine from algorithm

level to architectural design level. The main achievement of this dissertation includes a baseline and an

advanced disparity estimation algorithms, and two fast algorithms for the advanced one, and a high

throughput disparity estimation design.

The contributions in each achievement are as follows. First, the baseline disparity estimation

algorithm combines the belief propagation approach to increase the computational parallelism of

disparity estimation, and the joint bilateral upsampling approach to decrease the computational space.

In addition, we also solve their memory cost problems by architectural design techniques. Second,

based on the baseline algorithm, we propose the advanced disparity estimation algorithm that could

(21)

3

DERS algorithm. Third, we also propose two fast disparity estimation algorithms to accelerate the

high-quality algorithm by different strategies for different implementation methods. For the

processor-based platform, the sparse-computation algorithm could reduce the original execution time

to 62.9% by reducing the processed pixels from dense to sparse space. On the other hand, for the

hardware design, the hardware-efficient algorithm could reduce the original memory cost to 0.00029%

by replacing the belief propagation with the proposed cost diffusion method. Finally, we propose a

high throughput disparity estimation engine for the hardware-efficient algorithm with three-stage

row-based pipelining architecture. The dedicated design could achieve the throughput of 95 frames/s

for three HD1080p view disparity maps, using 1,645K gate counts and 59.4-Kbyte memory.

In the objective quality evaluation, the experimental results show that our proposed advanced

disparity estimation algorithm could perform better than the DERS algorithm, especially for the

temporal consistency. In addition, the proposed fast algorithms have similar performance to the

advanced algorithm, and the final hardware design has slight quality degradation because of its

simplification.

To sum up, the proposed disparity estimation design could deliver the disparity maps with the

high throughput and high quality to satisfy the requirement of high definition 3DTV applications.

1.4 Dissertation Organization

This dissertation is organized as follows. Chapter II introduces the general framework of a

disparity estimation algorithm, and the existing approaches of each step in the framework. Chapter III

analyzes the algorithm and architecture of the belief propagation and the joint bilateral upsampling,

and presents the baseline disparity estimation algorithm. To improve the quality and speed of baseline

algorithm, Chapter IV proposes the high-quality disparity estimation algorithm and its two fast

algorithms: sparse-computation and hardware-efficient. Then, Chapter V compares the disparity

(22)

4

methods. With the hardware-efficient algorithm, Chapter VI proposes the architecture of disparity

estimation engine, and demonstrates our implementation results. Finally, Chapter VII concludes this

(23)

5

II Background

In this chapter, the background of disparity estimation and its application to view synthesis are

introduced. This chapter is organized as follows. First, we present the concept of disparity estimation,

and review the existing disparity estimation algorithms. Then, we illustrate the view synthesis

technique, depth-image-based rendering (DIBR), which is our target application of disparity

estimation. Finally, we introduce the state-of-the-art disparity estimation algorithm [63] developed by

MPEG 3-D Video Coding (3DVC), and point out its quality and design problems.

2.1 Disparity Estimation

In 3DTV applications, the disparity estimation is to extract the disparity information from source

videos and generate a disparity map for each frame. The disparity map can describe the relative

distance of objects in scene, and be further used to generate virtual-view videos. For different number

of input video view, the disparity estimation has different approach. The 2-D to 3-D conversion

approach is for traditional single-view videos, while the stereo correspondence approach is for

two-view and multiple-view videos. The former one recognizes the disparity map from various

disparity cues, such as texture, defocus, vanish point, and etc. [102], [103], [104]. On the other hand,

the latter one finds the pairs of correspondences to compute disparity maps. The dissertation focuses

on the stereo correspondence approach.

2.1.1 Epipolar Geometry

The disparity estimation for multi-view videos could be constrained by the epipolar geometry to

reduce the correspondence search range from 2-D space to 1-D space. Figure II-1 shows the concept

of epipolar geometry with two-view configuration. In which, the object Pb is watched by the target

(24)

6

correspondence candidates with p would be located on the ray from C to Pb, whose projected line in

the reference image plane is called epipolar line. In other words, the correspondence with p could be

searched on the epipolar line, and the search range is restricted in 1-D space.

Furthermore, the image planes could be rectified and translated into the new positions with

parallel epipolar lines as shown in Figure II-2. In which, the correspondence search range is on a

horizontal line, instead of an oblique line in the original image plane. In other words, the pair of

correspondences is at the identical y-coordinate in two views. Thus, the computation of disparity

estimation can be regular in the raster-scan order.

Figure II-1 Epipolar geometry

Figure II-2 Image planes with rectification

With the rectified image planes, Figure II-3 shows the relation between depth and disparity for a

pair of correspondences. In which, the two cameras at the viewpoints C and C’ capture the object point

Pb and project it to the pair of correspondences on the epipolar line. The correspondences are located Pb Pf C C’ p e’pf’ pb’ Target view Reference view Epipolar line Pb C C’ p e e’ pb’ Target view Reference view

(25)

7

at the coordinates of X and -X’ based on their camera centers. Given the focal length f and the baseline

B of the cameras, if we could estimate the disparity X-X’, the object depth Z can be acquired by

Z = 𝑓 × 𝐵

𝑋 − 𝑋′ . (II-1)

Therefore, the disparity estimation is to find the pair of correspondences, and use their x-coordinates to

compute disparity value of depth value for each pixel.

Figure II-3 Relation between disparity and depth for a pair of correspondences

2.1.2 General Algorithm Flow

For disparity estimation algorithms, a general framework is proposed by Scharstien and Szeliski

[105] as shown in Figure II-4. In this framework, two images are captured and rectified as inputs, and

a disparity map is the target result. By this framework, disparity estimation algorithms can be

classified into the two categories: local approach and global approach [105], [106]. The local approach

only consists of the matching cost calculation and the cost aggregation, and the global approach

additionally performs the optimization process. The last disparity refinement step is an optional

process for computing fractional disparity and other post-processing. The existing approaches for each

step are reviewed as follows.

Z X -X’ C C’ Pb f f B Epipolar Line Object

(26)

8

Figure II-4 A general framework for disparity estimation algorithms

1.

Matching Cost Calculation

Matching cost is a quantitative dissimilarity measure to find the best pair of correspondences.

Figure II-5 shows the concept of the matching cost calculation. In which, a target pixel has multiple

reference pixels as correspondence candidates, and each correspondence candidate has a matching cost.

The number of correspondence candidates is equal to the disparity range DR, which is related to the

nearest and farthest objects in scene. Hence, each target pixel has DR matching costs. To determine a

whole disparity map, the matching costs of all target pixels are calculated and form a disparity image

space (DSI), which is called cost cube in this dissertation. As shown In Figure II-6, a cost cube

contains the spatial dimensions X, Y and the disparity dimension d. The size of this cube for whole

frame is H×W×DR where H and W are the frame height and width. The initial values of the cost cube

are computed by the matching cost calculation.

Matching Cost Calculation

Cost Aggregation

Disparity Selection/Optimization

Disparity Refinement

Target View Reference View

Target-View Disparity Map

(27)

9

Figure II-5 Matching costs of a target pixel and its correspondence candidates

Figure II-6 Illustration of a cost cube

To compute the initial cost cube C0, one of the various match metrics [105]-[3] could be adopted.

Table II-1 lists the commonly used match metrics, which can be classified into pixel base and block

base. For the pixel-based match metric, the absolute difference (AD) and the square difference (SD)

are computed using a target pixel and a reference pixel. The pixel dissimilarity measure (PDM)

additionally considers the half pixels to lessen the sampling sensitivity [1].

On the other hand, the block-based match metric is computed using a target block and a reference

block with support pixels as illustrated in Figure II-7. In Table II-1, the normalized cross correlation

(NCC) is a statistical method that uses the block mean and variance to reduce the sensitivity to

radiometric gain and bias. The Rank transforms the pixel color into the rank value, which is the

relative order of center pixel in the block, and computes the matching cost by the rank difference. On

the other hand, the Census transforms the pixel intensity into census bit stream, which consists of the

Target Pixel Reference Pixels

DR

(x, y) (x, y)

…… Matching Costs Target-view Frame Reference-view Frame

A Pair of Correspondences d x y d = DR-1 d = 0 d = 1 d = 2 W H DR

(28)

10

intensity comparison results between the center pixel and the support pixels. The matching cost of two

census bit streams is computed by the Hamming distance. Because the Rank and Census transform

original pixel from color to different domains, they could better resist the radiometric distortion

between views.

To sum up, the initial cost cube C0 is computed in this matching cost calculation step, and the

computational complexity of this step is O(H×W×DR).

Figure II-7 Block-based matching cost with the block radius r

Table II-1 Various match metrics for computing C0(x, y, d) Pixel-based metric

Absolute Difference (AD) _|𝐼_𝑡𝑎𝑟(𝑥, 𝑦) − 𝐼_𝑟𝑒𝑓_{(𝑥 − 𝑑, 𝑦)|} Square Difference (SD) _[𝐼

𝑡𝑎𝑟(𝑥, 𝑦) − 𝐼𝑟𝑒𝑓(𝑥 − 𝑑, 𝑦)] 2

Pixel Dissimilarity Measure (PDM) _{𝑚𝑖𝑛{|𝐼}_𝑡𝑎𝑟_{(𝑥, 𝑦) − 𝐼}_𝑟𝑒𝑓_{(𝑥 − 𝑑, 𝑦)|, |𝐼}_𝑡𝑎𝑟_{(𝑥, 𝑦) − 𝐼}_𝑟𝑒𝑓+ _{|, |𝐼}_𝑡𝑎𝑟_{(𝑥, 𝑦) − 𝐼}_𝑟𝑒𝑓− _|} where 𝐼_𝑟𝑒𝑓+ and 𝐼_𝑟𝑒𝑓− are the neighboring half pixel of 𝐼𝑟𝑒𝑓(𝑥 − 𝑑, 𝑦)

Block-based metric Normalized Cross Correlation

(NCC) ∑|𝑥−𝑢|≤𝑟,𝐼𝑡𝑎𝑟(𝑢, 𝑣) − 𝐼̅𝑡𝑎𝑟-[𝐼𝑟𝑒𝑓(𝑢 − 𝑑, 𝑣) − 𝐼̅𝑡𝑟𝑒𝑓] |𝑦−𝑣|≤𝑟 √∑ ,𝐼𝑡𝑎𝑟(𝑢, 𝑣) − 𝐼̅𝑡𝑎𝑟-2[𝐼𝑟𝑒𝑓(𝑢 − 𝑑, 𝑣) − 𝐼̅𝑡𝑟𝑒𝑓] 2 |𝑥−𝑢|≤𝑟 |𝑦−𝑣|≤𝑟 Rank _|𝐼′_𝑡𝑎𝑟(𝑥, 𝑦) − 𝐼′_𝑟𝑒𝑓(𝑥 − 𝑑, 𝑦)|, where 𝐼′(𝑚, 𝑛) = ∑_{|𝑚−𝑢|≤𝑟,|𝑛−𝑣|≤𝑟}𝐼(𝑚, 𝑛) > 𝐼(𝑢, 𝑣) Census _{𝐻𝑎𝑚𝑚𝑖𝑛𝑔 .𝐼′} 𝑡𝑎𝑟(𝑥, 𝑦), 𝐼′𝑟𝑒𝑓(𝑥 − 𝑑, 𝑦)/, where 𝐼′(𝑚, 𝑛) = 𝑏𝑖𝑡𝑠𝑡𝑟𝑒𝑎𝑚_{|𝑚−𝑢|≤𝑟,|𝑛−𝑣|≤𝑟}(𝐼(𝑚, 𝑛) > 𝐼(𝑢, 𝑣))

2.

Cost Aggregation

The main idea of cost aggregation step is to gather the costs of neighboring pixels to the center

pixel in a window. It implies that the neighboring pixels have the same disparity as the center pixel,

and gather the costs of neighbors could increase the reliability of matching cost. Thus, the cost

aggregation step accumulate the neighboring costs for the center pixel by the general equation, Target Block Reference Block

(x-d, y) (x, y) (u, v) Support pixels r

(29)

11

𝐶_{𝑎𝑔𝑔𝑟}(𝑥, 𝑦, 𝑑) =∑(𝑢,𝑣)∈𝑤𝑖𝑛(𝑥,𝑦)_∑ 𝐶0(𝑢, 𝑣, 𝑑) ∙ 𝑊_𝑊 𝑎𝑔𝑔𝑟(𝑢, 𝑣)

𝑎𝑔𝑔𝑟(𝑢, 𝑣)

(𝑢,𝑣)∈𝑤𝑖𝑛(𝑥,𝑦) , (II-2)

where C0 is the initial cost cube, and Caggr is the aggregated cost cube. In this equation, each initial cost

C0(v, u, d) in an aggregation window with radius r is accumulated with the weight Waggr(u, v) for the

target cost Caggr(x, y, d). In addition, the accumulated value is normalized by the sum of weights. The

computational complexity of this step is O(H×W×DR×r2), which is proportional to the aggregation

window size.

Figure II-8 shows the various existing cost aggregation approaches with different weight

distributions. In Figure II-8 (a), the uniform weight has constant weight for each support pixels and the

fixed r. Its disparity map would be over-blurred for thin objects if r is too large, while it would be

incorrect for textureless regions if r is too small. Therefore, for better disparity quality, the radius of

uniform weight need to be adaptively adjusted according to image content as shown in Figure II-8 (b).

The other common-used is the Gaussian weight approach that makes the pixel near window center has

higher weight. However, these three approaches could not obtain accurate disparity due to their fixed

window shape, (i.e. square or circle).

To control the window shape, the adaptive polygon weight approach [4], [5] uses the 8-direction

or 4-direction configuration to fit the object shape as shown in Figure II-8 (d). Then, the cross-based

weight approach [6] uses multiple cross lines to fit the object shape as shown in Figure II-8 (e). In the

two approaches, a support region grows from the window center until its boundary touches a

dissimilar pixel. However, the two approaches could not perform well for the highly texture regions

because of their continuous support regions.

The adaptive support-weight (ADSW) approach [7] can avoid their problem, because all support

pixels are considered and their weight is determined by the kernels of bilateral filter. Its weight is

defined as

(30)

12

where Wtar is the weight from target-view window, and Wref is the weight from reference-view window.

Both the weights Wtar and Wref are computed by the kernels of bilateral filter,

𝑊(𝑢, 𝑣) = 𝑓(‖(𝑥, 𝑦) − (𝑢, 𝑣)‖)𝑔(‖𝐼(𝑥, 𝑦) − 𝐼(𝑢, 𝑣)‖) . (II-4) where f is the spatial kernel with the position distance, and g is the range kernel with the color distance.

With the two kernels, the aggregation weight would be large if the support pixel is near the center

pixel or the support pixel is similar to center pixel. Figure II-8 (f) illustrates the adaptive

support-weight. In which, the aggregation weight could fit object shape better than the adaptive

polygon weight and cross-based weight approaches for highly texture regions. However, the main

disadvantage of ADSW approach is high computational complexity. Nevertheless, it can be addressed

by the integral histogram approach [8], the iterative aggregation with small window approach [9], and

the data reuse approach in VLSI design [10].

In summary, the aggregation cost step processes the initial cost cube C0 to a more reliable cost

cube Caggr by the well-define weights.

(a) (b) (c)

(d) (e) (f)

Figure II-8 Various cost aggregation approaches

(a) uniform weight, (b) uniform weight with adaptive window radius, (c) Gaussian weight, (d) adaptive polygon weight, (e) cross-based weight, (f) adaptive support-weight.

3.

Disparity Selection/Optimization 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 3 3 5 2 3 1 2 3 2 5 3 8 5 5 3 3 2 1 2 3 2 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 0 0 0 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 8 8 5 8 3 5 2 3 1 2 3 2 8 3 8 8 5 8 3 5 8 8 3 8 8

(31)

13

With the aggregated cost cube Caggr, two optional methods can be applied to compute the

disparity map. One is the winner-take-all manner (WTA) which directly determines the disparity result

by selecting the reference pixel with minimum cost as the best correspondence for each target pixel.

The other one is the disparity optimization method which considers the aggregated costs of whole

frame to compute the disparity map by the energy minimization. The latter can acquire more accurate

disparity maps as shown in the evaluation results [72].

The common-used disparity optimization approaches are dynamic programming (DP), graph-cut

(GC), and belief propagation (BP). Their main concept is to convert the disparity estimation problem

into an energy minimization problem. The energy function is generally formulated by

𝐸(𝒅) = 𝐸𝑑𝑎𝑡𝑎(𝒅) + 𝜆𝐸𝑠𝑚𝑜𝑜𝑡𝑕(𝒅) (II-5)

where Edata is data term to penalize the dissimilarity of a correspondence pair, and Esmooth is smoothness

term to penalize the disparity inconsistency of two neighboring pixels. In addition, d is a selected

disparity set for whole frame. The optimization approaches attempt to find a disparity set d by the way

of minimizing the total energy E.

The concept of the common-used optimization approaches are reviewed as follows.

(1)

Dynamic Programming

The main idea of DP approach is to convert the disparity estimation to a finding shortest path

problem. The optimization process is performed row by row. Figure II-9 (a) shows the graph model for

finding shortest path problem. In which, the position of node is corresponding to the coordinate in the

x-d plane, and the shortest path will be from x of 0 to W-1. The path would suffer from matching

penalty on a node, and smoothness penalty on an edge. The DP approach is to find the path with

minimum penalty by the two steps: forward accumulating and backward tracing. In Figure II-9 (b),

first step accumulate the penalty in the forward direction to select the moving direction for each node.

In Figure II-9 (c), with the moving direction map, the second step trace the path with minimum

(32)

14

However, the DP approach suffers from streak artifact in the disparity map because of its

row-by-row process. To address this problem, Ohta and Kanade [11] perform the DP in a 3-D space

that consists of the original intra-scanline space and the additional inter-scanline space. In addition, the

tree-based DP algorithms [12]-[14] use the tree structure to connect scanlines and remove the streak

artifacts.

(a)

(b)

(c)

Figure II-9 Concept of dynamic programming approach

(a) graph model in DP approach, (b) forward accumulating, (c) backward tracing

(2)

Graph-Cut

The main idea of GC approach is to convert the disparity selection problem to the

min-cut/max-flow problem [15], and the associated optimization techniques could be adopted. The GC

approach can generate accurate disparity maps.

Figure II-10 shows the graph model of min-cut/max-flow for disparity estimation. In which, there

are H×W×DR nodes with 6-connected node grid. The matching cost and the smoothness cost are

well-defined on each edge, which can be regarded as pipes with different flow volumes due to

…… source target W D R d x (0, y, 0) (W-1, y, 0) Forward Accumulating → ↑ → ↑ ↑ ↑ → ↓ → → →

Slice of cost cube ↑ → → → → ↑ ↑ ↓ ↓ ↓ ↓ d x (0, y, 0) (W-1, y, 0) Backward Tracing

(33)

15

different costs. In this graph model, water from the source node would flow to the sink node through

pipes. The min-cut means that a cut surface cross edges has the minimum flow, while the max-flow

means that the allowed maximum flow from the source to the sink. The min-cut and the max-flow are

equivalent problems. For the disparity estimation, the disparity map can be directly determined

according to the resultant cut surface.

Figure II-10 Graph model of graph-cut algorithm

For the min-cut/max-flow problem, the common-used optimization techniques are the

push-relabeling [16] and the augmenting path [17]. Their computational complexity is highly related

to the number of label candidate (i.e. disparity range DR in disparity estimation). However, the

optimization techniques suffer from extremely high computational complexity due to large disparity

range.

To reduce the computational complexity, Boykov proposed the swap method [18] and an efficient

augmenting path [19]. The swap method performs the optimization process disparity by disparity, and

only one new disparity is considered in an iteration. Based on the swap method, Chou et. al. [20]

proposed a fast algorithm to predict the disparities to skip the partial optimization process. On the

other hand, for the push-relabeling approach, the computational speed depends on the processing order

on nodes. Thus, Checkassky and Goldberg [21] proposed the highest-label order that is more efficient

than the typical FIFO order. In addition, Delong and Boykov [22] proposed a block-based graph cut

method to increase the parallelism of push-relabeling approach.

source sink

DR

W H

6-connected node Cut surface

(34)

16

To sum up, the GC approach can perform accurate disparity results but is not suitable to be

accelerated by GPU programming and VLSI design due to its irregular computation and low

parallelism.

(3)

Belief Propagation

Sun et al. [24] first applied the BP approach to solve the disparity estimation problem, and

acquired accurate disparity maps. They perform the energy minimization on the graph model as shown

in Figure II-11. In which, each node is corresponding to a pixel, and all nodes are connected by

4-connection grid. In the optimization process, the matching costs of each node are diffused through

the messages to neighboring nodes iteration by iteration. This diffusion mechanism is called message

passing. After several iterations, the matching costs and messages of a node are aggregated to

determine the disparity result. Although the minimized energy could not definitely converge due to its

loopy optimization process, the disparity maps could approach to a steady state.

Figure II-11 Graph model of belief propagation approach

In the BP approach, the message passing suffers from the highest computational complexity,

O(H×W×DR2×T), where T is the iteration count. The term of DR2 results from the convolution, and the

iteration count T should be more than 10. To reduce the computation of message passing, Felzenswalb

and Huttenlocher [25] proposed the hierarchical BP (HBP) and the linear-time message passing. The

former could accelerate the disparity convergent speed, and the latter could reduce the complexity of

convolution from O(DR2) to O(DR). In addition, Szeliski et al. [26] proposed the max-product loopy

belief propagation, called BP-M, to reduce the iteration count by a scale. Because the computation of

matching cost

(35)

17

BP approach is highly parallel, the BP approach is suitable to be accelerated by the GPU programming

and VLSI design [27]-[33].

In addition, the BP approach also suffers from highly memory cost, 4HW×DR, for the matching

costs and messages of whole frame. To address it, the bipartite gird [25] and the sliding approach [34]

are proposed for the memory access, and the predictive coding scheme [35] could be applied for

message compression.

To sum up, the above disparity optimization algorithms have different pros and cons. The DP

approach could achieve real-time speed easier but has the streak artifacts. Its improvement methods

would result in additional irregular computation. For the 2-D optimization approaches, the GC

approach has high performance of disparity map, but its irregular computation limits the acceleration

of GPU programming and VLSI design. On the other hand, the BP approach can also deliver accurate

disparity maps and has highly parallelism. Therefore, this dissertation develops an efficient disparity

estimation algorithm based on the BP approach.

4.

Disparity Refinement

The final step refines the disparity maps by the post-processing methods: occlusion handling,

object consistency enhancement, and temporal consistency enhancement. Their purpose and associated

algorithms are reviewed as follows.

(1)

Occlusion Handling

The occlusion problem results from that the object point is visible in one view and invisible in the

other view. Thus, there is no correspondence pixel in the invisible view. Incorrect disparities would

appear in the occlusion regions, and further induce artifacts in the view synthesis.

To handle the occlusion problem, the general approach is to detect the occlusion first, and then

fill it by the background disparities. These two steps are called occlusion detection and occlusion

filling. The basic methods for occlusion detection are surveyed in [45]. Various methods have different

(36)

18

disparity, and the occlusion constraint (OCC) assumes that the disparity gap of two pixels would result

in occlusion region in the other view. In addition, the order constraint (ORD) assumes that the order of

two pixels should have the correspondences with the same order in the other view. In the above

occlusion detection methods, the LRC is the most commonly applied for the disparity refinement [6],

[40], and the OCC and the ORD are combined into the disparity optimization step [15], [24]. With the

detected occlusion pixels, the occlusion filling step can directly replace them by the reliable

background disparities.

(2)

Object Consistency Enhancement

For an object, the disparities are usually identical or smooth changing. However, disparity

maps often suffer from incorrect disparities, especially in the textureless regions. To remove the

disparity noise, the plane fitting approach [46] is usually adopted by the high-performance disparity

estimation algorithms [63], [39], [40]. In the plane fitting approach, the segment information is first

computed by the watershed segmentation, mean-shift clustering, or K-mean clustering. According to

the segment information, the disparities in a segment are used to compute a new 3-D plane by the

linear regression method. Besides of the plane fitting method, the regional voting method [6] could

also refine the disparity maps well. The regional vote method is simpler than the plane fitting method

because the segment information is not required.

(3)

Temporal Consistency Enhancement

Most of research develops their disparity estimation algorithms using the still image sequences

[72]. However, they would miss the temporal consistency issue, which is important in the view

synthesis application for video sequences. Without enhancing the temporal consistency, the disparity

maps would suffer from flicker artifact, because each disparity frame is independently generated, and

the disparities are unstable in the occlusion and textureless regions. This flicker artifact would further

propagate to the view synthesis results, and is easily observed.

To address the temporal consistency, the neighboring frames should be considered in the disparity

(37)

19

flow with the spatial and temporal dimensions, and different smooth approaches are performed in the

disparity flow. On the other hand, with two adjacent frames, the temporal BP algorithm [41] preforms

the BP optimization in a 6-connection grid graph, where the two additional connections link to the

previous and next frames. In addition, the 3DVC’s DERS algorithm [65]-[67] adds the temporal cost

to matching cost according to previous disparity.

In summary, the disparity refinement step could fix the inconsistent disparities well, and improve

the view synthesis quality for 3DTV applications.

2.2 View Synthesis

In 3DTV applications, view synthesis is one of the most important components to synthesize a

single or multiple virtual view videos for the stereoscopic TV or the free-viewpoint TV [101]. A

common approach for view synthesis is the depth-image-based rendering (DIBR) algorithm [51]-[57],

which can warp a video to another view according to disparity maps.

Figure II-12 General flow of view synthesis

A general DIBR algorithm could be divided into the three steps: warping, blending, and hole

filling, as depicted in Figure II-12. For different number of input view, the DIBR algorithm has

different challenges in its steps. With single-view input, the DIBR algorithm suffers from large

Texture L Texture R Warping Warped Texture VL Warped Texture VR Disparity DL Disparity DR

Left-view Center-view Right-view

Hole Map HL Hole Map HR Blending Blended Texture V' Hole Map H' Hole Filling Resultant Texture V

(38)

20

occlusion holes in the hole filling step, while with multiple-view inputs, it suffers from inconsistent

warped pixels in the blending step. The concept and challenges of each step are presented in the

following.

2.2.1 Warping

In Figure II-12, the warping step loads the textures and disparities of reference side-views

generate the warped textures and hole maps of the target center-view. In the warping step, the

reference textures are shifted to the target view according the reference disparity maps.

The methods of warping step can be classified into the one-step warping and the two-step

warping as illustrated in Figure II-13. The one-step warping directly warps the reference textures to

the target view according to the warping position of disparities, while the two-step warping first warps

the target disparity and then uses it to synthesize the target texture. Rogmans et al. [58] and Morvan

[59] show that the two-step warping could perform better because its sampling precision is higher.

(a)

(b)

Figure II-13 Warping methods in view synthesis (a) one-step warping, (b) two-step warping

2.2.2 Blending

Texture Texture View 1 (reference) View 3 (reference) Disparity Disparity Texture View 2 (target) position position Texture Texture View 1 (reference) View 3 (reference) Disparity Disparity Texture View 2 (target) position position Disparity

(39)

21

With the multi-view inputs, the warping step will generate multiple textures for the target view as

shown in Figure II-12. In other words, there are multiple warped pixels for a target position. However,

the colors of these warped pixels are not consistent due to different radiometric gain and bias at

different viewpoints. Therefore, the warped pixels should be blended by different methods for the

three cases: visible pixel, occluded pixel, and disoccluded pixel, according to the hole maps. For the

case of visible pixel, the pixel is labeled “non-hole” in hole maps, and could be seen at multiple

viewpoints. Thus, its color can be computed by averaging the warped pixels. For the case of occluded

pixel, the pixel is labeled “non-hole” in one hole map only, and could be seen at only one viewpoint.

Thus, its color can refer to the only warped pixel. For the final case, the disoccluded pixel is labeled

“hole” in all hole maps, and cannot be seen at any viewpoints. Thus, it should be handled in the next step. In addition, the hole regions can be dilated before blending to avoid the ghost artifact as shown in

Figure II-14.

(a) (b)

Figure II-14 Blending step in view synthesis (a) without hole dilation, (b) with hole dilation

2.2.3 Hole Filling

With multiple-view inputs, most holes can be easily recovered by other views. For the remaining

disoccluded holes, they can be filled by the advanced in-painting method [60]. On the other hand, with

(40)

22

The occluded holes can be handled by the disparity smoothing methods [52]-[55] to reduce hole sizes,

and be filled by the interpolation method [53].

In summary, the 3DTV applications demand a view synthesis engine to generate virtual view

videos, and the DIBR algorithm could satisfy this requirement through the above steps. However, the

quality of view synthesis is highly dependent on the performance of disparity estimation. Therefore, it

is necessary to develop a high-performance disparity estimation algorithm for the 3-D video

production.

2.3 Review of DERS Algorithm from 3DVC

The 3D Video Coding (3DVC) team is organized in the Moving Picture Group Experts (MPEG)

to support the associated techniques for 3DTV applications. The associated techniques include the

disparity estimation, view synthesis, and multi-view video coding. The 3DVC team defines the

configuration of input and output views for the 3DTV system, and delivers the reference software for

disparity estimation [63] and view synthesis [64]. The algorithms in the reference software are

respectively called DERS algorithm and VSRS algorithm. They also create a test bed and quality

evaluation to assess the performance of 3-D videos. Furthermore, they combine the disparity

estimation and view synthesis with the multi-view video coding (MVC) [107] for data compression

and transmission. In this section, we introduce the 3DVC’s DERS algorithm and point out its design

challenges in the processing of high resolution videos. In addition, we present the 3DVC’s I/O

configuration and quality evaluation method, which are also adopted in this dissertation.

2.3.1 Input and Output View Configuration

The input and output setting is defined by the 3DVC [71] as shown in Figure II-15. In the 2-view

configuration, the disparity estimation and view synthesis engines loads the original left-view and

right-view videos to generate the virtual-view videos. Combining the synthesized video and one of the

(41)

23

configuration. In which, two view videos are synthesized for the stereoscopic display. For the 9-view

display, eight virtual-view videos need to be synthesized, and combined with the original center-view

video. Based on the above configurations, the disparity estimation and view synthesis engines can be

directly extended to support free viewpoint TV if more view videos are available.

(a) (b) (c)

Figure II-15 Input and output view configuration defined by the 3DVC

(a) 2-view configuration for stereoscopic display, (b) 3-view configuration for stereoscopic display, (c) 3-view configuration for 9-view display

2.3.2 DERS Algorithm

The depth estimation reference software (DERS) algorithm [63] delivered by the 3DVC is

illustrated in Figure II-16. The DERS algorithm uses the three view image frames to compute the

center-view disparity map. In addition, the previous image frame and disparity map are also involved

for the temporal consistency enhancement. Note that the DERS algorithm can support the input videos

without rectification. The steps in the DERS algorithm are introduced in the following.

DE and VS

OL OR

SR OL

OL: original left-view OR: original right-view OC: original center-view SR: synthesize right-view

DE and VS OL OC OR 0 1 0 0.5 0 1 2 0.5 1.5 DE and VS OL OC OR 0 1 2 0.5 1 1.5 …… ……

適用於高畫質立體電視應用之視差估測設計研究

國

立

交

通

大

學

電子工程學系 電子研究所

博 士 論 文

適用於高畫質立體電視應用之視差估測設計研究

The Study of Disparity Estimation Design for High Definition 3DTV

Applications

研 究 生：曾宇晟

指導教授：張添烜 教授

適用於高畫質立體電視應用之視差估測設計研究

The Study of Disparity Estimation Design for High Definition 3DTV

Applications

研 究 生：曾宇晟 Student：Yu-Cheng Tseng

指導教授：張添烜 Advisor：Tian-Sheuan Chang

國 立 交 通 大 學

電子工程學系 電子研究所

博 士 論 文

A Dissertation

Submitted to Department of Electronics Engineering and

Institute of Electronics

College of Electrical and Computer Engineering

National Chiao Tung University

in partial Fulfillment of the Requirements

for the Degree of

Doctor of Philosophy

in

Electronics Engineering

August 2011

Hsinchu, Taiwan, Republic of China

適 用 於 高 畫 質 立 體 電 視 應 用 之 視 差 估 測 設 計 研 究

學生: 曾宇晟

指導教授: 張添烜

國立交通大學電子工程學系暨電子所博士班

摘要

The Study of Disparity Estimation Design for High Definition 3DTV

Applications

Student: Yu-Cheng Tseng

Advisor: Dr. Tian-Sheuan Chang

Department of Electronics Engineering & Institute of Electronics

National Chiao Tung University

ABSTRACT

謝誌

Table of Contents

List of Tables

List of Figures

List of Symbols

I Introduction

1.1 Background

1.2 Motivation

1.3 Contribution

1.4 Dissertation Organization

II Background

2.1 Disparity Estimation

2.1.1 Epipolar Geometry

2.1.2 General Algorithm Flow

1.

2.

3.

(1)

(2)

(3)

4.

(1)

(2)

(3)

2.2 View Synthesis

2.2.1 Warping

2.2.2 Blending

2.2.3 Hole Filling

2.3 Review of DERS Algorithm from 3DVC

2.3.1 Input and Output View Configuration

2.3.2 DERS Algorithm

電子工程學系電子研究所

博士論文

研究生：曾宇晟

指導教授：張添烜教授

研究生：曾宇晟 Student：Yu-Cheng Tseng

國立交通大學

電子工程學系電子研究所

博士論文

適用於高畫質立體電視應用之視差估測設計研究