國
立
交
通
大
學
電子工程學系 電子研究所
博 士 論 文
適用於高畫質立體電視應用之視差估測設計研究
The Study of Disparity Estimation Design for High Definition 3DTV
Applications
研 究 生:曾宇晟
指導教授:張添烜 教授
適用於高畫質立體電視應用之視差估測設計研究
The Study of Disparity Estimation Design for High Definition 3DTV
Applications
研 究 生:曾宇晟 Student:Yu-Cheng Tseng
指導教授:張添烜 Advisor:Tian-Sheuan Chang
國 立 交 通 大 學
電子工程學系 電子研究所
博 士 論 文
A Dissertation
Submitted to Department of Electronics Engineering and
Institute of Electronics
College of Electrical and Computer Engineering
National Chiao Tung University
in partial Fulfillment of the Requirements
for the Degree of
Doctor of Philosophy
in
Electronics Engineering
August 2011
Hsinchu, Taiwan, Republic of China
i
適 用 於 高 畫 質 立 體 電 視 應 用 之 視 差 估 測 設 計 研 究
學生: 曾宇晟
指導教授: 張添烜
國立交通大學電子工程學系暨電子所博士班
摘要
隨著立體電視的問世,人們可以藉由立體視訊獲得新的視覺經驗。立體視訊可以立體攝影 機擷取,並經由影像處理技術運算後,可支援多視角與自由視點之立體電視應用。在立體視訊 的處理中,視差估測為最重要的技術之一。視差估測可產生拍攝場景之視差圖,可用於虛擬視 角視訊的合成。動態影像壓縮標準組織的立體視訊編碼團隊已提出目前最先進視差估測演算法。 其演算法可針對立體電視的應用產生高品質的視差圖,但因採用圖形切割演算法導致高運算複 雜度與低平行運算的問題。特別對於高畫質視訊,其問題更為嚴重。 為解決以上問題,本論文首先提出初階視差估測演算法,採用訊息傳遞演算法以提高視差 估測的運算平行度,並搭配聯合雙邊上取樣演算法以減少運算的畫面大小。其硬體設計面臨之 問題,可藉由所提出之硬體架構方法解決。以此初階演算法為基礎,我們進一步提出一高品質 視差估測演算法,可改善時間軸一致性與遮蔽之問題,並產生高品質的視差圖。針對高品質視 差演算法,我們提出適用於不同實作方法的二快速視差估測演算法。針對軟體程式設計,所提 出的稀疏運算之快速演算法可藉由時間軸與空間軸的分析選擇稀疏像素,僅針對稀疏像素更新 視差值,達到降低運算時間至 62.9%。另一方面,針對超大型積體電路設計,所提出的高硬體 效率之快速演算利用新的比對資訊擴散方法可降低運算時間至 57.2%,並大幅降低原演算的記 憶體成本至 0.00029%。客觀評比的結果顯示針對虛擬視角視訊合成之應用,我們所提出的演算 法可達到近於現今最先進演算法的高品質。ii
最後,我們化簡高硬體效率之快速演算法,進而提出高輸出效能的架構設計。其硬體實作 結果顯示所提出的視差估測引擎可支援視差範圍 128,同時產生三視角 HD1080p 視差圖,並達 到每秒 95 畫面的輸出速度,也就是每秒 75.64G 像素視差。總言之,本論文所提出的視差估測 設計可滿足高畫質度立體電視應用的需求。
iii
The Study of Disparity Estimation Design for High Definition 3DTV
Applications
Student: Yu-Cheng Tseng
Advisor: Dr. Tian-Sheuan Chang
Department of Electronics Engineering & Institute of Electronics
National Chiao Tung University
ABSTRACT
With emerging 3DTVs, human can have new visual experience from 3D videos that can be
captured by new stereo camera and further processed by image processing techniques for the 3DTV
applications of multi-view or free viewpoint. In the 3D video processing, one of the most important
techniques is the disparity estimation that could generate disparity maps for synthesizing virtual-view
videos. The state-of-the-art disparity estimation algorithm proposed by the MPEG 3D Video Coding
team could deliver high-quality disparity maps, but suffers from high computational complexity and
low parallelism due to its graph-cut algorithm, especially for high definition videos.
To address the problems, this dissertation first proposes the baseline disparity estimation
algorithm that adopts the belief propagation algorithm to increase the parallelism of disparity
estimation, and the joint bilateral upsampling algorithm to reduce the computational resolution. Their
design challenges could be solved by our proposed architectural design methods. Based on the
baseline algorithm, we further propose the high-quality algorithm that could well improve the
temporal consistency and occlusion problems, and deliver high performance disparity maps. To
accelerate the high-quality algorithm, we propose the two fast algorithms for different implementation
method. The sparse-computation fast algorithm could decrease the processed pixels in the spatial and
iv
other hand, for the hardware implementation, we propose the hardware-efficient fast algorithm that
could reduce the execution time of high-quality algorithm to 57.2%, and decrease the memory cost of
belief propagation to 0.00029% by the proposed cost diffusion method. The objective evaluation
results show that our disparity quality is similar to the quality of state-of-the-art algorithm for view
synthesis applications.
Moreover, we further simplify the hardware-efficient algorithm and propose a high-throughput
architectural design. The implementation results shows that the proposed disparity estimation engine
could achieve the throughput of 95 frames/s for three view HD1080p disparity maps with 128
disparity levels (i.e. 75.64G pixel-disparities/s). It could satisfy the requirement of high definition
v
謝誌
從大學四年到研究所五年,交大給予我許多成長與回憶。在此,藉由博士論文的完成以感 謝所有的人。首要感謝的是指導老師張添烜教授,自大三的專題研究至博士班研讀,不論在研 究方法、論文撰寫與投稿皆給予我耐心的指導與建議。接著要感謝王聖智教授,在我大學四年 級時指導我和嘉賓進行色盲專題研究,更推薦我得以逕讀博士班,並擔任我的博士學位口試委 員。另外也感謝其他口試委員,包含楊家輝教授、李鎮宜教授、杭學鳴教授、蔣迪豪教授、蔡 宗漢教授及林嘉文教授,願意撥空給予指導。 工程四館 427 實驗室是我博士班在交大停留最久的地方,首要感謝的是張彥中學長,教導 我良好的研究方法與態度,並引領我進入博士論文的研究題目。也感謝作最久實驗室同學的國 龍,在實驗室的五年裡與我分享研究及生活,並一同朝著取得博士學位努力。接著要感謝實驗 室的學長們:佑昆、朝鐘、君偉、裕仁、錦木、國亘、旻奇、子筠、嘉俊、英澤、得瑋、秈璟, 傳授我硬體設計基本觀念,並營造實驗室和樂的氣氛。當然也感謝實驗室的同學:宗憲、景竹、 瑋呈、瑋城,因為有你們實驗室總是歡笑不斷。另外,感謝和我一起合作的實驗室學弟妹們: 之悠、博淵、政君、孟維、筱珊、博雄、奕君、瑩蓉、宥辰、元歆、英佑、克嘉、亮齊、孟勳。 最後要感謝我的家人和女友,從攻讀博士班的決定,資格考的準備,期刊論文審稿的等待, 到博士論文的撰寫與口試,一路上有你們的支持與陪伴使我能夠取得此學位。 此博士論文獻給以上所有感謝的人。ii
Table of Contents
摘要 ... i ABSTRACT ... iii 謝誌 ... v List of Symbols ... xi I Introduction ... 1 1.1 Background ... 1 1.2 Motivation ... 1 1.3 Contribution ... 2 1.4 Dissertation Organization ... 3 II Background ... 5 2.1 Disparity Estimation ... 5 2.1.1 Epipolar Geometry ... 52.1.2 General Algorithm Flow ... 7
2.2 View Synthesis ... 19
2.2.1 Warping ... 20
2.2.2 Blending ... 20
2.2.3 Hole Filling ... 21
2.3 Review of DERS Algorithm from 3DVC ... 22
2.3.1 Input and Output View Configuration ... 22
2.3.2 DERS Algorithm ... 23
2.3.3 Reference Software for 3-View Configuration ... 26
2.3.4 Evaluation Method for Disparity Quality ... 27
2.3.5 Design Challenges ... 31
2.4 Summary ... 33
III Baseline Disparity Estimation with Belief Propagation and Joint Bilateral Filter for High Definition 3DTV Applications ... 34
3.1 Introduction ... 34
3.1.1 Baseline Belief Propagation ... 34
3.1.2 Joint Bilateral Upsampling ... 36
3.2 Analysis and Design of Baseline Belief Propagation ... 38
3.2.1 Analysis of Belief Propagation ... 38
3.2.2 Proposed Low Memory Cost Access Approach ... 41
iii
3.2.4 Implementation Result ... 54
3.3 Analysis and Design of Joint Bilateral Filtering ... 58
3.3.1 Related Acceleration Approaches ... 58
3.3.2 Analysis of Integral Histogram Approach ... 62
3.3.3 Proposed Memory Reduction Methods ... 66
3.3.4 Proposed Architecture ... 69
3.3.5 Implementation Result ... 75
3.4 Baseline Disparity Estimation Algorithm ... 77
3.4.1 Baseline Algorithm ... 77
3.4.2 Comparison ... 79
3.5 Summary ... 84
IV Advanced Disparity Estimation Algorithms for High Definition 3DTV Applications ... 86
4.1 High-Quality Disparity Estimation Algorithm ... 86
4.1.1 Related Work ... 86
4.1.2 Observation in DERS and Baseline Algorithms ... 89
4.1.3 Proposed Algorithm Flow... 91
4.1.4 Downsampled Disparity Estimation for Full Range Disparity ... 93
4.1.5 Joint Bilateral Upsampling ... 98
4.1.6 Occlusion Handling ... 99
4.1.7 Temporal Consistency Enhancement... 102
4.2 Sparse-Computation Disparity Estimation Algorithm ... 106
4.2.1 Related Work ... 107
4.2.2 Proposed Algorithm Flow... 107
4.2.3 Sparse Pixel Selection ... 111
4.2.4 Sparse-Computation Steps... 115
4.2.5 Computational Reduction ... 116
4.3 Hardware-Efficient Disparity Estimation Algorithm ... 117
4.3.1 Design Challenges in High-Quality Algorithm ... 118
4.3.2 Proposed Algorithm Flow... 121
4.3.3 Cost Diffusion Algorithm ... 122
4.3.4 Image Buffer Reduction Methods ... 126
4.3.5 Small Filter Window Size ... 127
4.3.6 Regular Occlusion Handling ... 128
4.3.7 Simple Region Detection ... 129
4.4 Summary ... 131
V Experimental Results ... 132
5.1 Experiment Setting ... 132
iv
5.1.2 Input and Output Configuration ... 134
5.2 Comparison ... 136
5.2.1 Execution Time ... 136
5.2.2 Objective Quality Evaluation ... 138
5.3 Summary ... 159
VI Design of Disparity Estimation Engine for High Definition 3DTV Applications ... 161
6.1 Architectural Analysis ... 161
6.1.1 Analysis of Hardware-Efficient Disparity Estimation Algorithm... 161
6.1.2 Proposed Hardware-Based Algorithm ... 164
6.2 Overview of Disparity Estimation Engine ... 168
6.2.1 Proposed Three-Stage Pipelining Architecture ... 168
6.2.2 Schedule of Main Core ... 169
6.3 Detailed Architectural Design ... 170
6.3.1 Low-Resolution Disparity Estimation Stage ... 171
6.3.2 Occlusion Handling Stage ... 179
6.3.3 High-Resolution Disparity Estimation Stage ... 184
6.4 External Memory Access ... 189
6.4.1 Bandwidth Requirement ... 189
6.4.2 External Memory Architecture ... 191
6.4.3 Data Configuration in External Memory ... 192
6.4.4 External Memory Access Schedule ... 194
6.5 Implementation Result ... 195 6.5.1 Hardware Cost ... 195 6.5.2 Disparity Quality ... 199 6.6 Summary ... 203 VII Conclusion ... 204 7.1 Contribution ... 204 7.2 Future Work ... 205 Bibliography ... 206 作者簡歷 ... 215
v
List of Tables
Table II-1 Various match metrics for computing C0(x, y, d) ... 10
Table III-1 Comparison of memory cost in memory access approaches for the iteration count of 30 .. 55
Table III-2 Logic cost comparison of PE architectures ... 57
Table III-3 Implementation results of various BP-based algorithms ... 58
Table III-4 Comparison of BF acceleration approach in computational complexity and memory cost 59 Table III-5 Computational flow and analysis for a pixel in the integral histogram approach ... 63
Table III-6 Modified computational flow and analysis for a pixel in the integral histogram approach 70 Table III-7 Example implementation result of the proposed architecture ... 76
Table III-8 Comparison of hardware cost per frame ... 76
Table III-9 Previous VLSI implementations of bilateral filtering ... 77
Table III-10 Comparison of different implementations ... 77
Table IV-1 Simulation results with different sampling factors in Y-PSNR (dB) ... 94
Table IV-2 Comparison of execution time of HQ-DE and SC-DE algorithms ... 117
Table IV-3 Window sizes of filter-based processes in HQ-DE algorithm ... 120
Table IV-4 Comparison of memory requirement between BP-M and cost diffusion methods ... 124
Table IV-5 Window sizes of filter-based processes in HE-DE algorithm ... 128
Table V-1 Test sequences ... 134
Table V-2 Input and output views for 2-view configuration [71] ... 135
Table V-3 Input and out views for 3-view configuration [71] ... 135
Table V-4 Experiment setting in our evaluation ... 136
Table V-5 Average execution time of proposed algorithms on PC for one frame ... 137
Table V-6 Average execution time scaled to HD1080p resolution and disparity range of 128 ... 137
Table V-7 Evaluation results of Y-PSNR for View0 ... 139
Table V-8 Evaluation results of Y-PSNR for View8 ... 139
Table V-9 Evaluation results of SSIM for View0 ... 140
Table V-10 Evaluation results of SSIM for View8 ... 141
Table V-11 Evaluation results of T_PSPNR (dB) for View0... 142
Table V-12 Evaluation results of T_PSPNR for View8 ... 142
Table VI-1 Estimated average external bandwidth for computing four disparity rows. ... 191
Table VI-2 Performance of the proposed disparity estimation engine ... 196
Table VI-3 Internal SRAM usage in the proposed disparity estimation engine ... 196
Table VI-4 Internal registers in the proposed disparity estimation engine ... 197
Table VI-5 Area of the computational logic ... 197
Table VI-6 Comparison of our design and previous implementation ... 198
Table VI-7 Evaluation results of Y-PSNR for View0 ... 200
Table VI-8 Evaluation results of Y-PSNR for View8 ... 200
vi
Table VI-10 Evaluation results of SSIM for View8... 201 Table VI-11 Evaluation results of T_PSPNR (dB) for View0 ... 202 Table VI-12 Evaluation results of T_PSPNR (dB) for View8 ... 202
vii
List of Figures
Figure II-1 Epipolar geometry... 6
Figure II-2 Image planes with rectification ... 6
Figure II-3 Relation between disparity and depth for a pair of correspondences ... 7
Figure II-4 A general framework for disparity estimation algorithms ... 8
Figure II-5 Matching costs of a target pixel and its correspondence candidates ... 9
Figure II-6 Illustration of a cost cube ... 9
Figure II-7 Block-based matching cost with the block radius r ... 10
Figure II-8 Various cost aggregation approaches ... 12
Figure II-9 Concept of dynamic programming approach ... 14
Figure II-10 Graph model of graph-cut algorithm ... 15
Figure II-11 Graph model of belief propagation approach... 16
Figure II-12 General flow of view synthesis ... 19
Figure II-13 Warping methods in view synthesis ... 20
Figure II-14 Blending step in view synthesis ... 21
Figure II-15 Input and output view configuration defined by the 3DVC ... 23
Figure II-16 Flow of the DERS algorithm ... 24
Figure II-17 Data flow for 3-view configuration ... 27
Figure II-18 Example of temporal noise changing successive frames [76] ... 30
Figure II-19 Example of block matching in the DERS algorithm... 32
Figure III-1 Illustrations of BP ... 35
Figure III-2 Configuration of the message passing PEs ... 40
Figure III-3 Traditional fixed memory access approach in a 1-D node line for node n3 computation . 43 Figure III-4 Proposed spinning-message approach ... 44
Figure III-5 Proposed spinning-message approach in a 2-D node plane for node n3 computation ... 45
Figure III-6 Comparison of memory access approaches in different node planes ... 45
Figure III-7 Sliding node plane in different directions ... 46
Figure III-8 Sliding node plane with the spinning-message approach ... 47
Figure III-9 Bipartite node plane with the spinning-message approach ... 48
Figure III-10 Proposed sliding-bipartite node plane ... 49
Figure III-11 Pseudo code of the message passing for calculating a new message ... 51
Figure III-12 Architecture of Park’s PE ... 51
Figure III-13 Proposed architecture ... 53
Figure III-14 Ratio of memory cost in different node planes with spinning-message approach ... 56
Figure III-15 Classification of BF acceleration approaches ... 59
Figure III-16 Concept of histogram-based approaches ... 61
Figure III-17 Concept of integral histogram approach ... 64
viii
Figure III-19 Stripe-based method (SBM) ... 68
Figure III-20 Sliding origin method (SOM) ... 69
Figure III-21 Proposed architecture of JBF. ... 71
Figure III-22 Schedule of the proposed architecture ... 72
Figure III-23 Selected-bin adder in the histogram calculation engines ... 73
Figure III-24 Proposed architectures of histogram calculation engines hic and hcc ... 73
Figure III-25 Proposed architecture of (a) convolution engine and (b) its table selection modules ... 75
Figure III-26 Flow of the proposed baseline disparity estimation algorithm ... 78
Figure III-27 Experimental results of the baseline algorithm and the DERS algorithm ... 80
Figure III-28 Center disparity maps and synthesized View8 of baseline algorithm at the 100th frame 82 Figure III-29 Center disparity maps and synthesized View8 of DERS algorithm at the 100th frame ... 84
Figure IV-1 Flow of the adaptive-BP algorithm [39] ... 87
Figure IV-2 Flow of the double-BP algorithm [40] ... 88
Figure IV-3 An example of flicker artifact of the baseline algorithm in BookArrival ... 90
Figure IV-4 An example of foreground copy artifact of the DERS algorithm in BookArrival ... 90
Figure IV-5 An example of occlusion problem at the 44th frame of BookArrival ... 91
Figure IV-6 Flow of the HQ-DE algorithm for a center-view disparity map ... 92
Figure IV-7 Flow of the HQ-DE algorithm for a side view disparity map ... 93
Figure IV-8 Comparison of different sampling factors in the average Y-PSNR of two frames ... 94
Figure IV-9 Simulation results using the sampling factors of 1/2×1/4 and 1/4×1/4 ... 95
Figure IV-10 Illustration of downsampled disparity estimation for full disparity range ... 96
Figure IV-11 Comparison between the original regional vote [6] and the proposed window vote ... 99
Figure IV-12 Illustration of the proposed occlusion detection method ... 100
Figure IV-13 Results with and without the proposed occlusion handling method in BookArrival ... 102
Figure IV-14 Results of the HQ-DE algorithm in BookArrival compared to Figure IV-5 ... 102
Figure IV-15 Concept of the proposed no-motion registration (NMR) method ... 104
Figure IV-16 Results of the proposed NMR method in BookArrival ... 105
Figure IV-17 Results of the proposed NMR method in the 32th, 34th, 36th, 38th frames ... 105
Figure IV-18 Results of the proposed SEP method in BookArrival ... 106
Figure IV-19 Profiling of the HQ-DE algorithm on PC ... 108
Figure IV-20 Flow of the SC-DE algorithm for center-view disparity map ... 109
Figure IV-21 Flow of the SC-DE algorithm for side-view disparity maps ... 111
Figure IV-22 Flow of region detection for sparse pixel selection ... 112
Figure IV-23 Example of edge maps in BookArrival ... 113
Figure IV-24 Example of occlusion maps in BookArrival ... 113
Figure IV-25 Example of motion maps in BookArrival ... 115
Figure IV-26 Concept of sparse SSAD and sparse ADSW methods ... 115
Figure IV-27 Concept of sparse BP-M method ... 116
ix
Figure IV-29 Image buffer required by the SSAD and ADSW steps ... 119
Figure IV-30 Flow of the HE-DE algorithm for center view ... 122
Figure IV-31 Concept of BP-M computation ... 123
Figure IV-32 Concept of the proposed window-based SSAD method ... 127
Figure IV-33 Flow of proposed occlusion handling method in HE-DE algorithm ... 128
Figure IV-34 Flow of edge detection and motion detection in HE-DE algorithm ... 130
Figure V-1 Clips of test sequences in center view... 133
Figure V-2 Evaluation results of Y-PNSR ... 140
Figure V-3 Evaluation results of SSIM ... 141
Figure V-4 Evaluation results of T_PSPNR ... 143
Figure V-5 Disparity maps and view synthesized images in the 50th frame of BookArrival ... 145
Figure V-6 Disparity maps and view synthesized images in the 50th frame of LoveBird1 ... 147
Figure V-7 Disparity maps and view synthesized images in the 100th frame of Newspaper... 149
Figure V-8 Disparity maps and view synthesized images in the 50th frame of Café ... 149
Figure V-9 Disparity maps and view synthesized images in the 50th frame of Kendo ... 150
Figure V-10 Disparity maps and view synthesized images in the 100th frame of Balloons ... 151
Figure V-11 Disparity maps and view synthesized images in the 50th frame of Champagne... 153
Figure V-12 Disparity maps and view synthesized images in the 50th frame of Pantomime ... 155
Figure V-13 Disparity maps and view synthesized images in the 50th frame of Hall1 ... 156
Figure V-14 Disparity maps and view synthesized images in the 50th frame of Hall2 ... 157
Figure V-15 Disparity maps and view synthesized images in the 167th frame of CarPark ... 158
Figure V-16 Disparity maps and view synthesized images in the 50th frame of CarPark ... 159
Figure VI-1 Data dependency of the HE-DE algorithm ... 162
Figure VI-2 Required row buffers in filter-based processes for pipelining architecture ... 163
Figure VI-3 Memory buffers in the motion detection ... 164
Figure VI-4 Flow of the proposed HW-DE algorithm ... 165
Figure VI-5 Proposed motion detection in the HW-DE algorithm ... 167
Figure VI-6 Overview architecture of the proposed disparity estimation engine ... 169
Figure VI-7 Proposed computational schedule for main core ... 170
Figure VI-8 Architecture of the low-resolution disparity estimation stage ... 172
Figure VI-9 Data access of the motion detection module in the frame coordinate system ... 173
Figure VI-10 Architecture of the motion detection module ... 173
Figure VI-11 Input and required data in matching cost calculation for three target views ... 174
Figure VI-12 Architecture of the window-based SSAD and DPotts modules ... 175
Figure VI-13 Architecture of the temporal cost calculation module ... 176
Figure VI-14 Architecture of vertical cost diffusion module ... 177
Figure VI-15 Fully parallel architecture of the horizontal cost diffusion module ... 178
Figure VI-16 Architecture of the horizontal cost diffusion module ... 179
x
Figure VI-18 Architecture of the disparity cross warping module ... 181
Figure VI-19 Architecture of the occlusion detection PE and the warp filling PE ... 182
Figure VI-20 Architecture of the good disparity detection module ... 183
Figure VI-21 Architecture of border filling and inside filling modules ... 184
Figure VI-22 Architecture of the high-resolution disparity estimation stage ... 185
Figure VI-23 Memory configuration in the high-resolution disparity estimation stage ... 186
Figure VI-24 Architecture of the joint bilateral upsampling module ... 187
Figure VI-25 Architecture of the window vote module ... 188
Figure VI-26 Architecture of the mask and vote PEs for the window vote module ... 188
Figure VI-27 Architecture of the still-edge preservation module ... 189
Figure VI-28 Rough schedule for external memory access ... 190
Figure VI-29 Architecture of external memory in our design ... 191
Figure VI-30 Read and write latency in the SDRAM model [110] ... 192
Figure VI-31 Data configuration in external memory ... 194
Figure VI-32 Schedule of external memory access for one HD1080p frame at 800MHz... 195
Figure VI-33 Evaluation results of Y-PNSR ... 200
Figure VI-34 Evaluation results of SSIM ... 201
xi
List of Symbols
Symbols Descriptions
H, W Frame height, frame width
DR Disparity range
IS0,S1 S2
Image frame where
S0 could be H for high resolution, and L for low resolution,
S1 could be L for left view, C for center view, and R for right view, and S2 could be t for current frame, and t-1 for previous frame
DS0,S1 S2
Disparity map where
S0 could be H for high resolution, and L for low resolution,
S1 could be L for left view, C for center view, and R for right view, and S2 could be t for current frame, and t-1 for previous frame
C0 Initial cost cube computed by the matching cost calculation step
Caggr Cost cube computed by the cost aggregation step
Cview Cost for inter-view consistency constraint
Ctemp Cost for temporal consistency enhancement
Cvert Cost for vertical diffusion method
Ctotal Final Cost for disparity optimization
T Iteration count in belief propagation
D(d) Data term in disparity optimization process
1
I Introduction
1.1 Background
With the prompt development of 3-D display techniques, people could obtain the new visual
experience from 3-D videos, which have multi-view videos for left and right eyes. Compared to
traditional 2-D videos, 3-D videos could make human have the distance feeling of scene with the
additional video processes: calibration and rectification, multi-view video coding, disparity estimation,
and virtual view synthesis. For these 3-D video processes, the Moving Picture Experts Group (MPEG)
3-D Video Coding (3DVC) has delivered a basic 3DTV framework that consists of the depth
estimation reference software (DERS) [63], view synthesis reference software (VSRS) [64], and
Multi-view Video Coding (MVC) standard [107]. They also provide the multi-view video sequences
[71] for the performance evaluation. The basic 3DTV framework can be extended to various systems
such as the stereoscopic TV for multiple viewers and the free-viewpoint TV for a larger viewing zone
[100], [101].
For the basic 3DTV framework, the previous VLSI implementation of VSRS and MVC decoder
[61], [62] can reach the real-time performance for high definition videos. On the other hand, the DERS
could deliver high quality disparity maps but suffers from high computational complexity due to its
graph-cut optimization, especially for high definition videos. Therefore, it is necessary to develop a
disparity estimation engine that could deliver high quality disparity maps and achieve the real-time
performance for high definition videos.
1.2 Motivation
Many disparity estimation algorithms have been developed in computer vision for different
2
accuracy evaluation [72] shows that the graph-cut and the belief propagation approaches could
perform better than other kinds of approaches. Based on the graph-cut approach, the state-of-the-art
DERS algorithm delivered by MPEG 3DVC could generate high quality disparity maps for 3DTV
applications, but it still encounters the following problems. First, the temporal consistency problem is
not addressed well due to the foreground copy artifact. Second, its execution time will be dramatically
increased with the increasing video resolution and disparity range. For one HD1080p frame, it takes
more than 20 minutes in average on a personal computer. Third, the computation of graph-cut is
irregular and iterative, so that it is not suitable to be accelerated by the parallel computing PEs of VLSI
design or multi-core platform.
Motivated by the problems in the state-of-the-art disparity estimation algorithm, the goal of this
dissertation is to develop a new disparity estimation engine that could not only generate high quality
disparity maps, but also achieve the throughput of 60 frames/s for the HD1080p resolution to satisfy
the requirement of high definition 3DTV applications.
1.3 Contribution
To achieve the above goal, this dissertation develops a disparity estimation engine from algorithm
level to architectural design level. The main achievement of this dissertation includes a baseline and an
advanced disparity estimation algorithms, and two fast algorithms for the advanced one, and a high
throughput disparity estimation design.
The contributions in each achievement are as follows. First, the baseline disparity estimation
algorithm combines the belief propagation approach to increase the computational parallelism of
disparity estimation, and the joint bilateral upsampling approach to decrease the computational space.
In addition, we also solve their memory cost problems by architectural design techniques. Second,
based on the baseline algorithm, we propose the advanced disparity estimation algorithm that could
3
DERS algorithm. Third, we also propose two fast disparity estimation algorithms to accelerate the
high-quality algorithm by different strategies for different implementation methods. For the
processor-based platform, the sparse-computation algorithm could reduce the original execution time
to 62.9% by reducing the processed pixels from dense to sparse space. On the other hand, for the
hardware design, the hardware-efficient algorithm could reduce the original memory cost to 0.00029%
by replacing the belief propagation with the proposed cost diffusion method. Finally, we propose a
high throughput disparity estimation engine for the hardware-efficient algorithm with three-stage
row-based pipelining architecture. The dedicated design could achieve the throughput of 95 frames/s
for three HD1080p view disparity maps, using 1,645K gate counts and 59.4-Kbyte memory.
In the objective quality evaluation, the experimental results show that our proposed advanced
disparity estimation algorithm could perform better than the DERS algorithm, especially for the
temporal consistency. In addition, the proposed fast algorithms have similar performance to the
advanced algorithm, and the final hardware design has slight quality degradation because of its
simplification.
To sum up, the proposed disparity estimation design could deliver the disparity maps with the
high throughput and high quality to satisfy the requirement of high definition 3DTV applications.
1.4 Dissertation Organization
This dissertation is organized as follows. Chapter II introduces the general framework of a
disparity estimation algorithm, and the existing approaches of each step in the framework. Chapter III
analyzes the algorithm and architecture of the belief propagation and the joint bilateral upsampling,
and presents the baseline disparity estimation algorithm. To improve the quality and speed of baseline
algorithm, Chapter IV proposes the high-quality disparity estimation algorithm and its two fast
algorithms: sparse-computation and hardware-efficient. Then, Chapter V compares the disparity
4
methods. With the hardware-efficient algorithm, Chapter VI proposes the architecture of disparity
estimation engine, and demonstrates our implementation results. Finally, Chapter VII concludes this
5
II Background
In this chapter, the background of disparity estimation and its application to view synthesis are
introduced. This chapter is organized as follows. First, we present the concept of disparity estimation,
and review the existing disparity estimation algorithms. Then, we illustrate the view synthesis
technique, depth-image-based rendering (DIBR), which is our target application of disparity
estimation. Finally, we introduce the state-of-the-art disparity estimation algorithm [63] developed by
MPEG 3-D Video Coding (3DVC), and point out its quality and design problems.
2.1 Disparity Estimation
In 3DTV applications, the disparity estimation is to extract the disparity information from source
videos and generate a disparity map for each frame. The disparity map can describe the relative
distance of objects in scene, and be further used to generate virtual-view videos. For different number
of input video view, the disparity estimation has different approach. The 2-D to 3-D conversion
approach is for traditional single-view videos, while the stereo correspondence approach is for
two-view and multiple-view videos. The former one recognizes the disparity map from various
disparity cues, such as texture, defocus, vanish point, and etc. [102], [103], [104]. On the other hand,
the latter one finds the pairs of correspondences to compute disparity maps. The dissertation focuses
on the stereo correspondence approach.
2.1.1 Epipolar Geometry
The disparity estimation for multi-view videos could be constrained by the epipolar geometry to
reduce the correspondence search range from 2-D space to 1-D space. Figure II-1 shows the concept
of epipolar geometry with two-view configuration. In which, the object Pb is watched by the target
6
correspondence candidates with p would be located on the ray from C to Pb, whose projected line in
the reference image plane is called epipolar line. In other words, the correspondence with p could be
searched on the epipolar line, and the search range is restricted in 1-D space.
Furthermore, the image planes could be rectified and translated into the new positions with
parallel epipolar lines as shown in Figure II-2. In which, the correspondence search range is on a
horizontal line, instead of an oblique line in the original image plane. In other words, the pair of
correspondences is at the identical y-coordinate in two views. Thus, the computation of disparity
estimation can be regular in the raster-scan order.
Figure II-1 Epipolar geometry
Figure II-2 Image planes with rectification
With the rectified image planes, Figure II-3 shows the relation between depth and disparity for a
pair of correspondences. In which, the two cameras at the viewpoints C and C’ capture the object point
Pb and project it to the pair of correspondences on the epipolar line. The correspondences are located Pb Pf C C’ p e’pf’ pb’ Target view Reference view Epipolar line Pb C C’ p e e’ pb’ Target view Reference view
7
at the coordinates of X and -X’ based on their camera centers. Given the focal length f and the baseline
B of the cameras, if we could estimate the disparity X-X’, the object depth Z can be acquired by
Z = 𝑓 × 𝐵
𝑋 − 𝑋′ . (II-1)
Therefore, the disparity estimation is to find the pair of correspondences, and use their x-coordinates to
compute disparity value of depth value for each pixel.
Figure II-3 Relation between disparity and depth for a pair of correspondences
2.1.2 General Algorithm Flow
For disparity estimation algorithms, a general framework is proposed by Scharstien and Szeliski
[105] as shown in Figure II-4. In this framework, two images are captured and rectified as inputs, and
a disparity map is the target result. By this framework, disparity estimation algorithms can be
classified into the two categories: local approach and global approach [105], [106]. The local approach
only consists of the matching cost calculation and the cost aggregation, and the global approach
additionally performs the optimization process. The last disparity refinement step is an optional
process for computing fractional disparity and other post-processing. The existing approaches for each
step are reviewed as follows.
Z X -X’ C C’ Pb f f B Epipolar Line Object
8
Figure II-4 A general framework for disparity estimation algorithms
1.
Matching Cost CalculationMatching cost is a quantitative dissimilarity measure to find the best pair of correspondences.
Figure II-5 shows the concept of the matching cost calculation. In which, a target pixel has multiple
reference pixels as correspondence candidates, and each correspondence candidate has a matching cost.
The number of correspondence candidates is equal to the disparity range DR, which is related to the
nearest and farthest objects in scene. Hence, each target pixel has DR matching costs. To determine a
whole disparity map, the matching costs of all target pixels are calculated and form a disparity image
space (DSI), which is called cost cube in this dissertation. As shown In Figure II-6, a cost cube
contains the spatial dimensions X, Y and the disparity dimension d. The size of this cube for whole
frame is H×W×DR where H and W are the frame height and width. The initial values of the cost cube
are computed by the matching cost calculation.
Matching Cost Calculation
Cost Aggregation
Disparity Selection/Optimization
Disparity Refinement
Target View Reference View
Target-View Disparity Map
9
Figure II-5 Matching costs of a target pixel and its correspondence candidates
Figure II-6 Illustration of a cost cube
To compute the initial cost cube C0, one of the various match metrics [105]-[3] could be adopted.
Table II-1 lists the commonly used match metrics, which can be classified into pixel base and block
base. For the pixel-based match metric, the absolute difference (AD) and the square difference (SD)
are computed using a target pixel and a reference pixel. The pixel dissimilarity measure (PDM)
additionally considers the half pixels to lessen the sampling sensitivity [1].
On the other hand, the block-based match metric is computed using a target block and a reference
block with support pixels as illustrated in Figure II-7. In Table II-1, the normalized cross correlation
(NCC) is a statistical method that uses the block mean and variance to reduce the sensitivity to
radiometric gain and bias. The Rank transforms the pixel color into the rank value, which is the
relative order of center pixel in the block, and computes the matching cost by the rank difference. On
the other hand, the Census transforms the pixel intensity into census bit stream, which consists of the
Target Pixel Reference Pixels
DR
(x, y) (x, y)
…… Matching Costs Target-view Frame Reference-view Frame
A Pair of Correspondences d x y d = DR-1 d = 0 d = 1 d = 2 W H DR
10
intensity comparison results between the center pixel and the support pixels. The matching cost of two
census bit streams is computed by the Hamming distance. Because the Rank and Census transform
original pixel from color to different domains, they could better resist the radiometric distortion
between views.
To sum up, the initial cost cube C0 is computed in this matching cost calculation step, and the
computational complexity of this step is O(H×W×DR).
Figure II-7 Block-based matching cost with the block radius r
Table II-1 Various match metrics for computing C0(x, y, d) Pixel-based metric
Absolute Difference (AD) |𝐼𝑡𝑎𝑟(𝑥, 𝑦) − 𝐼𝑟𝑒𝑓(𝑥 − 𝑑, 𝑦)| Square Difference (SD) [𝐼
𝑡𝑎𝑟(𝑥, 𝑦) − 𝐼𝑟𝑒𝑓(𝑥 − 𝑑, 𝑦)] 2
Pixel Dissimilarity Measure (PDM) 𝑚𝑖𝑛{|𝐼𝑡𝑎𝑟(𝑥, 𝑦) − 𝐼𝑟𝑒𝑓(𝑥 − 𝑑, 𝑦)|, |𝐼𝑡𝑎𝑟(𝑥, 𝑦) − 𝐼𝑟𝑒𝑓+ |, |𝐼𝑡𝑎𝑟(𝑥, 𝑦) − 𝐼𝑟𝑒𝑓− |} where 𝐼𝑟𝑒𝑓+ and 𝐼𝑟𝑒𝑓− are the neighboring half pixel of 𝐼𝑟𝑒𝑓(𝑥 − 𝑑, 𝑦)
Block-based metric Normalized Cross Correlation
(NCC) ∑|𝑥−𝑢|≤𝑟,𝐼𝑡𝑎𝑟(𝑢, 𝑣) − 𝐼̅𝑡𝑎𝑟-[𝐼𝑟𝑒𝑓(𝑢 − 𝑑, 𝑣) − 𝐼̅𝑡𝑟𝑒𝑓] |𝑦−𝑣|≤𝑟 √∑ ,𝐼𝑡𝑎𝑟(𝑢, 𝑣) − 𝐼̅𝑡𝑎𝑟-2[𝐼𝑟𝑒𝑓(𝑢 − 𝑑, 𝑣) − 𝐼̅𝑡𝑟𝑒𝑓] 2 |𝑥−𝑢|≤𝑟 |𝑦−𝑣|≤𝑟 Rank |𝐼′𝑡𝑎𝑟(𝑥, 𝑦) − 𝐼′𝑟𝑒𝑓(𝑥 − 𝑑, 𝑦)|, where 𝐼′(𝑚, 𝑛) = ∑|𝑚−𝑢|≤𝑟,|𝑛−𝑣|≤𝑟𝐼(𝑚, 𝑛) > 𝐼(𝑢, 𝑣) Census 𝐻𝑎𝑚𝑚𝑖𝑛𝑔 .𝐼′ 𝑡𝑎𝑟(𝑥, 𝑦), 𝐼′𝑟𝑒𝑓(𝑥 − 𝑑, 𝑦)/, where 𝐼′(𝑚, 𝑛) = 𝑏𝑖𝑡𝑠𝑡𝑟𝑒𝑎𝑚|𝑚−𝑢|≤𝑟,|𝑛−𝑣|≤𝑟(𝐼(𝑚, 𝑛) > 𝐼(𝑢, 𝑣))
2.
Cost AggregationThe main idea of cost aggregation step is to gather the costs of neighboring pixels to the center
pixel in a window. It implies that the neighboring pixels have the same disparity as the center pixel,
and gather the costs of neighbors could increase the reliability of matching cost. Thus, the cost
aggregation step accumulate the neighboring costs for the center pixel by the general equation, Target Block Reference Block
(x-d, y) (x, y) (u, v) Support pixels r
11
𝐶𝑎𝑔𝑔𝑟(𝑥, 𝑦, 𝑑) =∑(𝑢,𝑣)∈𝑤𝑖𝑛(𝑥,𝑦)∑ 𝐶0(𝑢, 𝑣, 𝑑) ∙ 𝑊𝑊 𝑎𝑔𝑔𝑟(𝑢, 𝑣)
𝑎𝑔𝑔𝑟(𝑢, 𝑣)
(𝑢,𝑣)∈𝑤𝑖𝑛(𝑥,𝑦) , (II-2)
where C0 is the initial cost cube, and Caggr is the aggregated cost cube. In this equation, each initial cost
C0(v, u, d) in an aggregation window with radius r is accumulated with the weight Waggr(u, v) for the
target cost Caggr(x, y, d). In addition, the accumulated value is normalized by the sum of weights. The
computational complexity of this step is O(H×W×DR×r2), which is proportional to the aggregation
window size.
Figure II-8 shows the various existing cost aggregation approaches with different weight
distributions. In Figure II-8 (a), the uniform weight has constant weight for each support pixels and the
fixed r. Its disparity map would be over-blurred for thin objects if r is too large, while it would be
incorrect for textureless regions if r is too small. Therefore, for better disparity quality, the radius of
uniform weight need to be adaptively adjusted according to image content as shown in Figure II-8 (b).
The other common-used is the Gaussian weight approach that makes the pixel near window center has
higher weight. However, these three approaches could not obtain accurate disparity due to their fixed
window shape, (i.e. square or circle).
To control the window shape, the adaptive polygon weight approach [4], [5] uses the 8-direction
or 4-direction configuration to fit the object shape as shown in Figure II-8 (d). Then, the cross-based
weight approach [6] uses multiple cross lines to fit the object shape as shown in Figure II-8 (e). In the
two approaches, a support region grows from the window center until its boundary touches a
dissimilar pixel. However, the two approaches could not perform well for the highly texture regions
because of their continuous support regions.
The adaptive support-weight (ADSW) approach [7] can avoid their problem, because all support
pixels are considered and their weight is determined by the kernels of bilateral filter. Its weight is
defined as
12
where Wtar is the weight from target-view window, and Wref is the weight from reference-view window.
Both the weights Wtar and Wref are computed by the kernels of bilateral filter,
𝑊(𝑢, 𝑣) = 𝑓(‖(𝑥, 𝑦) − (𝑢, 𝑣)‖)𝑔(‖𝐼(𝑥, 𝑦) − 𝐼(𝑢, 𝑣)‖) . (II-4) where f is the spatial kernel with the position distance, and g is the range kernel with the color distance.
With the two kernels, the aggregation weight would be large if the support pixel is near the center
pixel or the support pixel is similar to center pixel. Figure II-8 (f) illustrates the adaptive
support-weight. In which, the aggregation weight could fit object shape better than the adaptive
polygon weight and cross-based weight approaches for highly texture regions. However, the main
disadvantage of ADSW approach is high computational complexity. Nevertheless, it can be addressed
by the integral histogram approach [8], the iterative aggregation with small window approach [9], and
the data reuse approach in VLSI design [10].
In summary, the aggregation cost step processes the initial cost cube C0 to a more reliable cost
cube Caggr by the well-define weights.
(a) (b) (c)
(d) (e) (f)
Figure II-8 Various cost aggregation approaches
(a) uniform weight, (b) uniform weight with adaptive window radius, (c) Gaussian weight, (d) adaptive polygon weight, (e) cross-based weight, (f) adaptive support-weight.
3.
Disparity Selection/Optimization 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 3 3 5 2 3 1 2 3 2 5 3 8 5 5 3 3 2 1 2 3 2 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 0 0 0 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 8 8 5 8 3 5 2 3 1 2 3 2 8 3 8 8 5 8 3 5 8 8 3 8 813
With the aggregated cost cube Caggr, two optional methods can be applied to compute the
disparity map. One is the winner-take-all manner (WTA) which directly determines the disparity result
by selecting the reference pixel with minimum cost as the best correspondence for each target pixel.
The other one is the disparity optimization method which considers the aggregated costs of whole
frame to compute the disparity map by the energy minimization. The latter can acquire more accurate
disparity maps as shown in the evaluation results [72].
The common-used disparity optimization approaches are dynamic programming (DP), graph-cut
(GC), and belief propagation (BP). Their main concept is to convert the disparity estimation problem
into an energy minimization problem. The energy function is generally formulated by
𝐸(𝒅) = 𝐸𝑑𝑎𝑡𝑎(𝒅) + 𝜆𝐸𝑠𝑚𝑜𝑜𝑡(𝒅) (II-5)
where Edata is data term to penalize the dissimilarity of a correspondence pair, and Esmooth is smoothness
term to penalize the disparity inconsistency of two neighboring pixels. In addition, d is a selected
disparity set for whole frame. The optimization approaches attempt to find a disparity set d by the way
of minimizing the total energy E.
The concept of the common-used optimization approaches are reviewed as follows.
(1)
Dynamic ProgrammingThe main idea of DP approach is to convert the disparity estimation to a finding shortest path
problem. The optimization process is performed row by row. Figure II-9 (a) shows the graph model for
finding shortest path problem. In which, the position of node is corresponding to the coordinate in the
x-d plane, and the shortest path will be from x of 0 to W-1. The path would suffer from matching
penalty on a node, and smoothness penalty on an edge. The DP approach is to find the path with
minimum penalty by the two steps: forward accumulating and backward tracing. In Figure II-9 (b),
first step accumulate the penalty in the forward direction to select the moving direction for each node.
In Figure II-9 (c), with the moving direction map, the second step trace the path with minimum
14
However, the DP approach suffers from streak artifact in the disparity map because of its
row-by-row process. To address this problem, Ohta and Kanade [11] perform the DP in a 3-D space
that consists of the original intra-scanline space and the additional inter-scanline space. In addition, the
tree-based DP algorithms [12]-[14] use the tree structure to connect scanlines and remove the streak
artifacts.
(a)
(b)
(c)
Figure II-9 Concept of dynamic programming approach
(a) graph model in DP approach, (b) forward accumulating, (c) backward tracing
(2)
Graph-CutThe main idea of GC approach is to convert the disparity selection problem to the
min-cut/max-flow problem [15], and the associated optimization techniques could be adopted. The GC
approach can generate accurate disparity maps.
Figure II-10 shows the graph model of min-cut/max-flow for disparity estimation. In which, there
are H×W×DR nodes with 6-connected node grid. The matching cost and the smoothness cost are
well-defined on each edge, which can be regarded as pipes with different flow volumes due to
…… source target W D R d x (0, y, 0) (W-1, y, 0) Forward Accumulating → ↑ → ↑ ↑ ↑ → ↓ → → →
Slice of cost cube ↑ → → → → ↑ ↑ ↓ ↓ ↓ ↓ d x (0, y, 0) (W-1, y, 0) Backward Tracing
15
different costs. In this graph model, water from the source node would flow to the sink node through
pipes. The min-cut means that a cut surface cross edges has the minimum flow, while the max-flow
means that the allowed maximum flow from the source to the sink. The min-cut and the max-flow are
equivalent problems. For the disparity estimation, the disparity map can be directly determined
according to the resultant cut surface.
Figure II-10 Graph model of graph-cut algorithm
For the min-cut/max-flow problem, the common-used optimization techniques are the
push-relabeling [16] and the augmenting path [17]. Their computational complexity is highly related
to the number of label candidate (i.e. disparity range DR in disparity estimation). However, the
optimization techniques suffer from extremely high computational complexity due to large disparity
range.
To reduce the computational complexity, Boykov proposed the swap method [18] and an efficient
augmenting path [19]. The swap method performs the optimization process disparity by disparity, and
only one new disparity is considered in an iteration. Based on the swap method, Chou et. al. [20]
proposed a fast algorithm to predict the disparities to skip the partial optimization process. On the
other hand, for the push-relabeling approach, the computational speed depends on the processing order
on nodes. Thus, Checkassky and Goldberg [21] proposed the highest-label order that is more efficient
than the typical FIFO order. In addition, Delong and Boykov [22] proposed a block-based graph cut
method to increase the parallelism of push-relabeling approach.
source sink
DR
W H
6-connected node Cut surface
16
To sum up, the GC approach can perform accurate disparity results but is not suitable to be
accelerated by GPU programming and VLSI design due to its irregular computation and low
parallelism.
(3)
Belief PropagationSun et al. [24] first applied the BP approach to solve the disparity estimation problem, and
acquired accurate disparity maps. They perform the energy minimization on the graph model as shown
in Figure II-11. In which, each node is corresponding to a pixel, and all nodes are connected by
4-connection grid. In the optimization process, the matching costs of each node are diffused through
the messages to neighboring nodes iteration by iteration. This diffusion mechanism is called message
passing. After several iterations, the matching costs and messages of a node are aggregated to
determine the disparity result. Although the minimized energy could not definitely converge due to its
loopy optimization process, the disparity maps could approach to a steady state.
Figure II-11 Graph model of belief propagation approach
In the BP approach, the message passing suffers from the highest computational complexity,
O(H×W×DR2×T), where T is the iteration count. The term of DR2 results from the convolution, and the
iteration count T should be more than 10. To reduce the computation of message passing, Felzenswalb
and Huttenlocher [25] proposed the hierarchical BP (HBP) and the linear-time message passing. The
former could accelerate the disparity convergent speed, and the latter could reduce the complexity of
convolution from O(DR2) to O(DR). In addition, Szeliski et al. [26] proposed the max-product loopy
belief propagation, called BP-M, to reduce the iteration count by a scale. Because the computation of
matching cost
17
BP approach is highly parallel, the BP approach is suitable to be accelerated by the GPU programming
and VLSI design [27]-[33].
In addition, the BP approach also suffers from highly memory cost, 4HW×DR, for the matching
costs and messages of whole frame. To address it, the bipartite gird [25] and the sliding approach [34]
are proposed for the memory access, and the predictive coding scheme [35] could be applied for
message compression.
To sum up, the above disparity optimization algorithms have different pros and cons. The DP
approach could achieve real-time speed easier but has the streak artifacts. Its improvement methods
would result in additional irregular computation. For the 2-D optimization approaches, the GC
approach has high performance of disparity map, but its irregular computation limits the acceleration
of GPU programming and VLSI design. On the other hand, the BP approach can also deliver accurate
disparity maps and has highly parallelism. Therefore, this dissertation develops an efficient disparity
estimation algorithm based on the BP approach.
4.
Disparity RefinementThe final step refines the disparity maps by the post-processing methods: occlusion handling,
object consistency enhancement, and temporal consistency enhancement. Their purpose and associated
algorithms are reviewed as follows.
(1)
Occlusion HandlingThe occlusion problem results from that the object point is visible in one view and invisible in the
other view. Thus, there is no correspondence pixel in the invisible view. Incorrect disparities would
appear in the occlusion regions, and further induce artifacts in the view synthesis.
To handle the occlusion problem, the general approach is to detect the occlusion first, and then
fill it by the background disparities. These two steps are called occlusion detection and occlusion
filling. The basic methods for occlusion detection are surveyed in [45]. Various methods have different
18
disparity, and the occlusion constraint (OCC) assumes that the disparity gap of two pixels would result
in occlusion region in the other view. In addition, the order constraint (ORD) assumes that the order of
two pixels should have the correspondences with the same order in the other view. In the above
occlusion detection methods, the LRC is the most commonly applied for the disparity refinement [6],
[40], and the OCC and the ORD are combined into the disparity optimization step [15], [24]. With the
detected occlusion pixels, the occlusion filling step can directly replace them by the reliable
background disparities.
(2)
Object Consistency EnhancementFor an object, the disparities are usually identical or smooth changing. However, disparity
maps often suffer from incorrect disparities, especially in the textureless regions. To remove the
disparity noise, the plane fitting approach [46] is usually adopted by the high-performance disparity
estimation algorithms [63], [39], [40]. In the plane fitting approach, the segment information is first
computed by the watershed segmentation, mean-shift clustering, or K-mean clustering. According to
the segment information, the disparities in a segment are used to compute a new 3-D plane by the
linear regression method. Besides of the plane fitting method, the regional voting method [6] could
also refine the disparity maps well. The regional vote method is simpler than the plane fitting method
because the segment information is not required.
(3)
Temporal Consistency EnhancementMost of research develops their disparity estimation algorithms using the still image sequences
[72]. However, they would miss the temporal consistency issue, which is important in the view
synthesis application for video sequences. Without enhancing the temporal consistency, the disparity
maps would suffer from flicker artifact, because each disparity frame is independently generated, and
the disparities are unstable in the occlusion and textureless regions. This flicker artifact would further
propagate to the view synthesis results, and is easily observed.
To address the temporal consistency, the neighboring frames should be considered in the disparity
19
flow with the spatial and temporal dimensions, and different smooth approaches are performed in the
disparity flow. On the other hand, with two adjacent frames, the temporal BP algorithm [41] preforms
the BP optimization in a 6-connection grid graph, where the two additional connections link to the
previous and next frames. In addition, the 3DVC’s DERS algorithm [65]-[67] adds the temporal cost
to matching cost according to previous disparity.
In summary, the disparity refinement step could fix the inconsistent disparities well, and improve
the view synthesis quality for 3DTV applications.
2.2 View Synthesis
In 3DTV applications, view synthesis is one of the most important components to synthesize a
single or multiple virtual view videos for the stereoscopic TV or the free-viewpoint TV [101]. A
common approach for view synthesis is the depth-image-based rendering (DIBR) algorithm [51]-[57],
which can warp a video to another view according to disparity maps.
Figure II-12 General flow of view synthesis
A general DIBR algorithm could be divided into the three steps: warping, blending, and hole
filling, as depicted in Figure II-12. For different number of input view, the DIBR algorithm has
different challenges in its steps. With single-view input, the DIBR algorithm suffers from large
Texture L Texture R Warping Warped Texture VL Warped Texture VR Disparity DL Disparity DR
Left-view Center-view Right-view
Hole Map HL Hole Map HR Blending Blended Texture V' Hole Map H' Hole Filling Resultant Texture V
20
occlusion holes in the hole filling step, while with multiple-view inputs, it suffers from inconsistent
warped pixels in the blending step. The concept and challenges of each step are presented in the
following.
2.2.1 Warping
In Figure II-12, the warping step loads the textures and disparities of reference side-views
generate the warped textures and hole maps of the target center-view. In the warping step, the
reference textures are shifted to the target view according the reference disparity maps.
The methods of warping step can be classified into the one-step warping and the two-step
warping as illustrated in Figure II-13. The one-step warping directly warps the reference textures to
the target view according to the warping position of disparities, while the two-step warping first warps
the target disparity and then uses it to synthesize the target texture. Rogmans et al. [58] and Morvan
[59] show that the two-step warping could perform better because its sampling precision is higher.
(a)
(b)
Figure II-13 Warping methods in view synthesis (a) one-step warping, (b) two-step warping
2.2.2 Blending
Texture Texture View 1 (reference) View 3 (reference) Disparity Disparity Texture View 2 (target) position position Texture Texture View 1 (reference) View 3 (reference) Disparity Disparity Texture View 2 (target) position position Disparity21
With the multi-view inputs, the warping step will generate multiple textures for the target view as
shown in Figure II-12. In other words, there are multiple warped pixels for a target position. However,
the colors of these warped pixels are not consistent due to different radiometric gain and bias at
different viewpoints. Therefore, the warped pixels should be blended by different methods for the
three cases: visible pixel, occluded pixel, and disoccluded pixel, according to the hole maps. For the
case of visible pixel, the pixel is labeled “non-hole” in hole maps, and could be seen at multiple
viewpoints. Thus, its color can be computed by averaging the warped pixels. For the case of occluded
pixel, the pixel is labeled “non-hole” in one hole map only, and could be seen at only one viewpoint.
Thus, its color can refer to the only warped pixel. For the final case, the disoccluded pixel is labeled
“hole” in all hole maps, and cannot be seen at any viewpoints. Thus, it should be handled in the next step. In addition, the hole regions can be dilated before blending to avoid the ghost artifact as shown in
Figure II-14.
(a) (b)
Figure II-14 Blending step in view synthesis (a) without hole dilation, (b) with hole dilation
2.2.3 Hole Filling
With multiple-view inputs, most holes can be easily recovered by other views. For the remaining
disoccluded holes, they can be filled by the advanced in-painting method [60]. On the other hand, with
22
The occluded holes can be handled by the disparity smoothing methods [52]-[55] to reduce hole sizes,
and be filled by the interpolation method [53].
In summary, the 3DTV applications demand a view synthesis engine to generate virtual view
videos, and the DIBR algorithm could satisfy this requirement through the above steps. However, the
quality of view synthesis is highly dependent on the performance of disparity estimation. Therefore, it
is necessary to develop a high-performance disparity estimation algorithm for the 3-D video
production.
2.3 Review of DERS Algorithm from 3DVC
The 3D Video Coding (3DVC) team is organized in the Moving Picture Group Experts (MPEG)
to support the associated techniques for 3DTV applications. The associated techniques include the
disparity estimation, view synthesis, and multi-view video coding. The 3DVC team defines the
configuration of input and output views for the 3DTV system, and delivers the reference software for
disparity estimation [63] and view synthesis [64]. The algorithms in the reference software are
respectively called DERS algorithm and VSRS algorithm. They also create a test bed and quality
evaluation to assess the performance of 3-D videos. Furthermore, they combine the disparity
estimation and view synthesis with the multi-view video coding (MVC) [107] for data compression
and transmission. In this section, we introduce the 3DVC’s DERS algorithm and point out its design
challenges in the processing of high resolution videos. In addition, we present the 3DVC’s I/O
configuration and quality evaluation method, which are also adopted in this dissertation.
2.3.1 Input and Output View Configuration
The input and output setting is defined by the 3DVC [71] as shown in Figure II-15. In the 2-view
configuration, the disparity estimation and view synthesis engines loads the original left-view and
right-view videos to generate the virtual-view videos. Combining the synthesized video and one of the
23
configuration. In which, two view videos are synthesized for the stereoscopic display. For the 9-view
display, eight virtual-view videos need to be synthesized, and combined with the original center-view
video. Based on the above configurations, the disparity estimation and view synthesis engines can be
directly extended to support free viewpoint TV if more view videos are available.
(a) (b) (c)
Figure II-15 Input and output view configuration defined by the 3DVC
(a) 2-view configuration for stereoscopic display, (b) 3-view configuration for stereoscopic display, (c) 3-view configuration for 9-view display
2.3.2 DERS Algorithm
The depth estimation reference software (DERS) algorithm [63] delivered by the 3DVC is
illustrated in Figure II-16. The DERS algorithm uses the three view image frames to compute the
center-view disparity map. In addition, the previous image frame and disparity map are also involved
for the temporal consistency enhancement. Note that the DERS algorithm can support the input videos
without rectification. The steps in the DERS algorithm are introduced in the following.
DE and VS
OL OR
SR OL
OL: original left-view OR: original right-view OC: original center-view SR: synthesize right-view
DE and VS OL OC OR 0 1 0 0.5 0 1 2 0.5 1.5 DE and VS OL OC OR 0 1 2 0.5 1 1.5 …… ……