即時的區域性立體視覺比對演算法分析與設計

全文

(1)國立交通大學電子工程學系. 電子研究所碩士班. 碩士論文. 即時的區域性立體視覺比對演算法分析與設計 Analysis and Design of Real-Time Local Stereo Matching. 研究生: 蔡宗憲指導教授: 張添烜. 中華民國九十七年九月.

(2)

(3) 即時的區域性立體視覺比對演算法分析與設計 Analysis and Design of Real-Time Local Stereo Matching. Student: Tsung-Hsien Tsai Advisor: Tian-Sheuan Chang. 研究生: 蔡宗憲指導教授: 張添烜博士. 國立交通大學電子工程學系電子研究所碩士班碩士論文. A Thesis Submitted to Department of Electronics Engineering & Institute of Electronics College of Electrical Engineering and Computer Science National Chiao Tung University in Partial Fulfillment of Requirements for the Degree of Master of Science In Electrical Engineering September 2008 Hsinchu, Taiwan, Republic of China. 中華民國. 九十七年. 九月.

(4)

(5) 即時的區域性立體視覺比對演算法分析與設計研究生：蔡宗憲. 指導教授：張添烜博士. 國立交通大學電子工程學系電子研究所. 摘要立體視覺廣泛的運用在許多領域，例如自走機器人、自動追蹤的攝影機、甚至於立體電視。由於許多的應用需要即時的立體視覺系統，因此需要設計一個能滿足高運算以及高頻寬的積體電路。本篇研究提出了一個適合硬體設計的演算法，係基於適應性權重的計算 (Adaptive Weight Generation)演算法結合微型普查(Mini-Census)的比對方式、兩次聚合(Two-Pass Aggregation)以及量子化指數曼哈頓距離(Quantized Manhattan Color Distance)等技巧。微型普查可以減少運算量，從原來的一個視窗的運算變成只有六個點運算。除此之外，他還加強了原本演算法中對於光線所造成的問題。兩階段資料匯集和量子化指數曼哈頓距離分別減少了 88.7%和 64.2%的運算複雜度。相較於原本的權重產生函式，量子化指數曼哈頓距離可以被實現成查表的硬體電路。最後在聯華電子 90 奈米製程下，提出的設計可以在 100MHz 的工作時脈下達到每秒計算 43 張 CIF 畫面大小及 64 個階層的深度估測。晶片總共需要 562,642 個邏輯閘，以及 21.3K 的晶片記憶體。. i.

(6) ii.

(7) Analysis and Design of Real-Time Local Stereo Matching Student: Tsung-Hsien Tsai. Advisor: Dr. Tian-Sheuan Chang. Department of Electronics Engineering & Institute of Electronics National Chiao Tung University. Abstract Stereo matching has been widely used in many fields, such as automatic robots, auto-tracking system, and even the 3D-TV. With these real time application demands, VLSI implementation becomes necessary to fulfill the high complexity and high bandwidth requirements of stereo matching algorithms. In this thesis, we propose a hardware friendly algorithm, based on adaptive support weight (ADSW), with mini-census, two-pass aggregation, and quantized exponential Manhattan distance techniques. The mini-census reduces the computation complexity from a matching block to only 6 points. Besides, it also improves the capability of ADSW to deal with the radiometric problem. The two-pass aggregation and the quantized Manhattan color distance reduce about 88.7% and 64.2% computation of the cost aggregation respectively. Comparing to the original weight generation function, the quantized Manhattan color distance can be easily implemented by a table based circuit. The final design implemented by UMC 90nm CMOS technology can achieve 43 frames per second and 64 disparities with CIF image size under 100MHz clock rate. The chip consumes totally 562,642 K gate counts and 21.3K Bytes internal memory. iii.

(8) iv.

(9) 誌. 謝. 首先，要感謝我的指導教授—張添烜博士，這兩年來給我的支持和鼓勵，研究方面讓我能在想法上能自由發揮，每當遇到困難的時候都能給予適當指導與足夠的資源來解決問題。老師不僅是研究上的良師也是生活上的益友，不僅了解學生的想法也協助學生處理生活上的各種問題。. 同時也要感謝我的口試委員們，交大電子王聖智教授和清華電機陳永昌教授，感謝教授們百忙之中抽空來指導我，各位的寶貴意見讓本論文更加完備。感謝 VSP 實驗室的好伙伴們，特別要謝謝引我入門的張彥中學長，帶領我從零開始，用嚴謹的態度逐步去解決問題，給予我不少中肯有用的建議並協助我在論文方面的寫作。感謝張彥中學長、林佑昆學長，你們傳給我的經驗與知識，讓我受用不盡。感謝李得瑋、郭子筠、林嘉俊和吳秈璟學長給予我許多 IC 設計的經驗以及研究的建議，也感謝廖英澤學長，在許多採購事物上的經驗傳承。感謝曾宇晟同學，從大學專題開始一直到 IC 競賽及助教，許多的事情少了你我都沒辦法一個人做得好，真的非常謝謝你！感謝詹景竹、戴瑋呈和張瑋城同學，我們各有特色，每天在實驗室一起努力和搞笑是一個難忘的回憶。感謝許博雄及陳奕均學弟，沒有你們的幫忙，我沒有辦法準時畢業！實驗室的黃筱珊、陳之悠、沈孟維、許博淵、蔡政君、廖元歆學弟們當然也不能忘記，和你們相處的日子真的很快樂。謝謝我的女友，謝謝你不斷的支持與鼓勵，也讓我對未來的學習之路有了全新的轉變。也感謝桌球隊的朋友，跟你們一同練球是我最充實的回憶。最後要感謝默默支持我的家人們，我的爸媽、姐姐，你們的溫暖是我努力最大的支柱。在此，把本論文獻給所有愛我與所有我愛的人。 v.

(10) vi.

(11) TABLE OF CONTENTS 1. . 2. . INTRODUCTION ...................................................................................................................... 1 1.1. . BACKGROUND ............................................................................................................................ 1 . 1.2. . MOTIVATION AND CONTRIBUTION ................................................................................................. 1 . 1.3. . ORGANIZATION OF THE THESIS ...................................................................................................... 2 . INTRODUCTION OF COMPUTATIONAL STEREO ........................................................................ 3 2.1. . OVERVIEW ................................................................................................................................ 3 . 2.2. . EPIPOLAR GEOMETRY .................................................................................................................. 3 . 2.3. . THE GENERAL FLOW OF MATCHING ALGORITHMS ............................................................................. 4 . 2.3.1. . Matching Cost Computation ............................................................................................. 4 . 2.3.2. . Cost Aggregation .............................................................................................................. 6 . 2.3.3. . Disparity Computation ...................................................................................................... 6 . 2.4. 3. . RELATED WORK ...................................................................................................................... 9 3.1. . OVERVIEW ................................................................................................................................ 9 . 3.2. . LOCAL APPROACH ....................................................................................................................... 9 . 3.3. . GLOBAL APPROACH .................................................................................................................. 10 . 3.4. . ADAPTIVE SUPPORT WEIGHT ...................................................................................................... 12 . 3.5. . REAL‐TIME IMPLEMENTATIONS .................................................................................................... 13 . 3.5.1. . General Purpose Processor ............................................................................................. 14 . 3.5.2. . Graphic Processing Unit .................................................................................................. 14 . 3.5.3. . Digital Signal Processing Processor................................................................................. 14 . 3.5.4. . Application‐Specific Integrated Circuit ............................................................................ 15 . 3.6. 4. . A TAXONOMY EVALUATION ........................................................................................................... 6 . SUMMARY .............................................................................................................................. 16 . PROPOSED MINI‐CENSUS ADAPTIVE SUPPORT WEIGHT ........................................................ 17 4.1. . INTRODUCTION ........................................................................................................................ 17 . 4.2. . THE FLOW OF THE PROPOSED ALGORITHM .................................................................................... 17 . 4.3. . MINI‐CENSUS .......................................................................................................................... 18 . 4.4. . WEIGHT GENERATION AND APPROXIMATION ................................................................................. 19 . 4.4.1. . The Performance with Different Color Space .................................................................. 20 . 4.4.2. . The Color Distance .......................................................................................................... 21 . 4.4.3. . The Effect of Proximity Weight........................................................................................ 22 . 4.4.4. . Quantized Exponential Function ..................................................................................... 22 . 4.4.5. . The Final Weight Table .................................................................................................... 24 . 4.5. . AGGREGATION ITERATION ........................................................................................................... 25 vii.

(12) 5. . 4.6. . TWO‐PASS COST AGGREGATION APPROXIMATION ........................................................................... 28 . 4.7. . OVERALL SIMULATION RESULT ..................................................................................................... 28 . DATA REUSE ANALYSIS OF HARDWARE IMPLEMENTATION ..................................................... 29 5.1. . OVERVIEW .............................................................................................................................. 29 . 5.2. . ARCHITECTURE OVERVIEW ......................................................................................................... 29 . 5.3. . MATCHING COST COMPUTATION REUSE ........................................................................................ 31 . 5.3.1. . Disparity‐Order Reuse ..................................................................................................... 31 . 5.3.2. . Pixel‐Order Reuse ............................................................................................................ 32 . 5.4. . 6. . COST AGGREGATION DATA REUSE ................................................................................................ 33 . 5.4.1. . Partial Column Reuse (PCR) ............................................................................................ 33 . 5.4.2. . Vertically Expanded Row Reuse (VERR) ........................................................................... 34 . 5.5. . COMPARISON .......................................................................................................................... 35 . 5.6. . SUMMARY .............................................................................................................................. 35 . HARDWARE IMPLEMENTATION ............................................................................................. 37 6.1. . OVERVIEW .............................................................................................................................. 37 . 6.2. . FUNCTIONAL BLOCK .................................................................................................................. 38 . 6.2.1. . Mini‐Census Transform ................................................................................................... 38 . 6.2.2. . Weight Generation ......................................................................................................... 39 . 6.2.3. . Aggregation and Winner‐Takes‐All ................................................................................. 40 . 6.2.4. . Input and Output Control ................................................................................................ 41 . 6.3. . HANDSHAKING ......................................................................................................................... 42 . 6.4. . ARBITRATION ........................................................................................................................... 43 . 6.5. . MEMORY ................................................................................................................................ 45 . 6.5.1. . Memory Update Mechanism .......................................................................................... 45 . 6.5.2. . Memory Size ................................................................................................................... 46 . 6.6. . IMPLEMENTATION RESULT .......................................................................................................... 48 . 6.6.1. . External Bandwidth ........................................................................................................ 48 . 6.6.2. . Area and Gate Counts ..................................................................................................... 49 . 6.7. . PERFORMANCE RESULT .............................................................................................................. 50 . CONCLUSION ................................................................................................................................. 53 FUTURE WORK .............................................................................................................................. 53 REFERENCE .................................................................................................................................... 55 . viii.

(13) LIST OF FIGURES FIG. 2‐1 THE EPIPOLAR GEOMETRY OF THE BINOCULAR STEREO. .............................................................................. 3 FIG. 2‐2 CORRESPONDENCE MATCHING FINDS THE ALL THE MATCHING PENALTIES OVER A DISPARITY RANGE. .................... 3 FIG. 4‐1 THE FLOW OF THE PROPOSED ALGORITHM ............................................................................................ 17 FIG. 4‐2 THE CENSUS TRANSFORM AND MATCHING ............................................................................................ 18 FIG. 4‐3 THE PERFORMANCE COMPARISON WITH DIFFERENT COLOR SPACE .............................................................. 20 FIG. 4‐4 THE PEROFORMANCE ANALSYSI OF PROXIMITY WEIGHTING ....................................................................... 22 FIG. 4‐5 THE WEIGHT FROM QUANTIZED EXPONENTIAL FUNCTION ......................................................................... 23 FIG. 4‐6 THE PERFORMANCE WITH QUANTIZED EXPONENTIAL FUNCTION ................................................................ 24 FIG. 4‐7 THE ERROR RATE WITH THE AGGREGATION ITERATION AND WINDOW SIZE .................................................... 26 FIG. 4‐8 THE MINIMUM ITERATION WITH DIFFERENT SIZE OF SUPPORT WINDOW ...................................................... 27 FIG. 5‐1 THE OVERVIEW OF HARDWARE ARCHITECTURE ....................................................................................... 29 FIG. 5‐2 THE TWO DATA REUSE DIRECTIONS WITH DIFFERENT SIZE OF SUPPORT WINDOW ........................................... 31 FIG. 5‐3 THE PARTIAL COLUMN REUSE (PCR) IN 5X5 AGGREGATION WINDOW ......................................................... 33 FIG. 5‐4 VERTICALLY EXPANDED ROW REUSE(VERR) .......................................................................................... 34 FIG. 5‐5 THE AVERAGE ACCESS COUNT VERSUS THE NUMBER OF EXPANDED PIXEL ..................................................... 35 FIG. 6‐1 THE OVERVIEW OF THE HARDWARE DESIGN ........................................................................................... 37 FIG. 6‐2 THE MODULE OF CENSUS TRANSFORM FOR LEFT AND RIGHT IMAGE ............................................................ 38 FIG. 6‐3 THE MODULE OF WEIGHT GENERATION OF VERTICAL AND HORIZONTAL WEIGHTS .......................................... 39 FIG. 6‐4 THE MODULE OF COST AGGREGATION AND ITS PROCESSING ELEMENT ......................................................... 40 FIG. 6‐5 THE PING‐PONG BUFFER OF COST AGGREGATION MODULE ....................................................................... 41 FIG. 6‐6 THE FINITE‐STATE‐MACHINE OF THE INPUT AND OUTPUT CONTROL ............................................................ 42 FIG. 6‐7 THE HANDSHAKING MECHANISM BETWEEN DIFFERENT MODULES .............................................................. 44 FIG. 6‐8 THE HYBRID OF ROUND‐ROBIN AND FIXED PRIORITY ARBITRATION STRATEGY ................................................ 45 FIG. 6‐9 THE COLUMN BASED CYCLIC BUFFER UPDATE MECHANISM ........................................................................ 46 FIG. 6‐10 THE MEMORY SIZE OF DIFFERENT MODULE ......................................................................................... 47 FIG. 6‐11 THE PERFORMANCE WITH THE BUS ACCESS LATENCY ............................................................................. 49 FIG. 6‐12 THE PERCENTAGE OF THE MEMORY AREA AND COMBINATIONAL GATE COUNTS ........................................... 50 FIG. 6‐13 THE IMPLEMENTATION RESULT WITH DIFFERENT METHOD ...................................................................... 52 . ix.

(14) LIST OF TABLES TABLE 2‐1 MATCH METRICS FOR CORRESPONDENCE MATCHING [3] ....................................................................... 5 TABLE 2‐2 THE TEST SEQUENCES OF THE TAXONOMY EVALUATION .......................................................................... 8 TABLE 4‐1 THE RESULT OF APPROXIMATED COLOR DISTANCE ............................................................................... 21 TABLE 4‐2 THE WEIGHT TABLE OF PRESERVING 2 MSB BITS ................................................................................ 25 TABLE 4‐3 THE WEIGHT TABLE OF PRESERVING 1 MSB BIT ................................................................................. 25 TABLE 4‐4 THE EFFECT OF DIFFERENT TECHNIQUES ........................................................................................... 28 TABLE 5‐1 THE RESULT OF APPROXIMATED COLOR DISTANCE ............................................................................... 36 TABLE 6‐1 THE IMPLEMENTATION RESULT OF AREA AND GATE COUNTS .................................................................. 49 TABLE 6‐2 THE ERROR RATE COMPARISON OF DIFFERENT METHOD ....................................................................... 51 TABLE 6‐3 THE PERFORMANCE COMPARISON OF DIFFERENT METHOD ................................................................... 51 . x.

(15) 1. Introduction 1.1. Background The stereo vision is one of the most popular topics in computer vision, and still attracts the attention of many researchers. The stereo vision is the process of finding the depth or distance information from a pair of images of the same scene. It can be used for many applications such as the 3D video conference [1], the Z-keying, and the virtual reality [2]. If we obtain the 3D depth map in the high speed, it is possible to merge the real and the virtual world in real time. The stereo algorithm can be categorized as local and global approach [3]. The local approach focuses on finding the similarities of reference and target windows by using the block matching or feature matching. The global approach uses the global constraints to optimize the result. Since the local approach favors low complexity, they are often adopted by real-time implementation. However, these methods often suffer from incorrect result on occlusion, uniform texture, and ambiguity.. The global. approach can solve these problems but suffer from the huge processing time. Although some real-time global methods can be implemented through GPU in the graphics card or MMX of CPU, the implementation still cost expensive for embedded applications since GPU and MMX are not dedicated hardware for stereo algorithms.. 1.2. Motivation and Contribution Motivated by the need of high accurate and low cost real-time stereo systems, this thesis proposed hardware friendly algorithm based on a state-of-art local approach. The goal is to build a dedicated hardware for low cost real-time depth estimator with high 1.

(16) accuracy. The major contribution in this thesis includes: 1.. We modified the adaptive support algorithm and make it more hardware friendly. The modified algorithm has much lower complexity and more capability of dealing with radiometric problem.. 2.. We analyze the pixel-order and disparity-order data reuse strategies with the vertically expanded row and partial column reuse methods.. 3.. We implemented and verified the real-time hardware of the proposed algorithm (Mini-Census Adaptive Support Weight).. 1.3. Organization of the Thesis In Chapter 2, we briefly introduce background of the computational stereo. In Chapter 3, we briefly introduce the stereo algorithms and real-time implementations. Chapter 4 discusses the detail of the proposed algorithm with the mini-census, two-pass aggregation, and quantized exponential Manhattan distance. In addition, the simulation result is shown in this chapter. Chapter 5 analyzes the data reuse problem of hardware design implemented by aggregation based algorithm. Chapter 6 shows the detail of the hardware design and the implementation result. Finally, the conclusion is given after Chapter 6.. . 2.

(17) 2. Introduction of Computational Stereo 2.1. Overview The concept of computational stereo is to construct the structure in the three-dimension space from different view point. The fundamental basis is to evaluate the depth of the object by finding the correspondent points of the object projecting on the two unique image pairs. The correspondent points are the feature points visible on both view point. The process of finding the correspondence is referred as correspondence matching. The disparity map for structure reconstruction can be computed after the correspondence matching.. 2.2. Epipolar Geometry . u v. (x,y). Target. Fig. 2‐1 The epipolar geometry of the binocular stereo. . disparity range (x-d,y) Candidate. Fig. 2‐2 Correspondence matching finds the all the matching penalties over a disparity range. . . Fig. 2-1 shows the binocular stereo calibrated with epipolar geometry. OL, OR, and f are the two optical centers, and the distance between them is called the baseline. The object P is projected on to two points (p and p’). The depth Z of the object P can be computed by triangulation. As a result, the formula of depth Z can be written as Z = f/d, where f is the focal length of the camera, d is the displacement of the two points, d=x-x’. 3.

(18) (depicted in Fig. 2-1). All the parameter can be obtained during the setup of the system except the displacement. Therefore, the goal of computational binocular stereo is to estimate the displacement between each corresponding pair of pixels in the target and candidate images (depicted in Fig. 2-2). The displacement is referred as disparity and the process is referred as disparity estimation. The set of disparity of all the pixels in an image is called the disparity map or disparity image.. 2.3. The General Flow of Matching Algorithms According to Scharstein and Szeliski [4], the major steps of the stereo algorithms consist of three steps: matching cost computation, cost (support) aggregation, and disparity computation/optimization. The matching cost in the first step represents the dissimilarity of different matching candidates. The cost aggregation is to sums up the result of the dissimilarities together, the concept of this is like exchanging the information of neighboring pixels. The last step is to compute the final disparity map from the matching cost. The details of them will be discussed in the following sections.. 2.3.1. Matching Cost Computation The disparities map can be computed by evaluating the matching cost for every disparity candidates. The matching cost represents the matching penalties after the correspondence matching. The range of the disparity candidates is called the disparity range. The correspondence matching is based on finding the correspondence of the support region of the reference and candidate pixels. The support region is usually a square window, which is called the support window. The match metrics are listed in TABLE 2-1. The details can be referred to [3]. The general formula of matching cost computation can be written as 4.

(19) . , ,. . where. ,. ,. ,. , . 2.1 . , represent the reference and target images. The matching result forms a. volume of matching cost in 3D space. The absolute difference (AD) is most commonly used for many stereo algorithms due to its simplicity. However, the AD has poor quality while the test image has the global radiometric changes. The experiment [5] shows that the rank and mutual information performs better than AD for global radiometric changes and noises due to the match metrics compares the difference of their local characteristics rather than absolute difference of luminance. TABLE 2‐1 match metrics for correspondence matching [3] MATCH METRIC. DEFINITION ,. ,. Normalized Cross-Correlation. Sum of Squared Difference. ,. ,. ,. Normalized Sum of Squared. , ·. ,. ,. Difference. ·. ,. ,. , ,. ,. ,. Sum of Absolute Difference. ,. ,. ,. ,. ,. ,. |. ,. ,. |. |. ,. ,. |. Rank ,. Census [6]. ,. ,. ,. ,. ,. ,. log. Mutual Information [7]. 5. ,. ,. P P. ,. , ,. , ,. P. ,.

(20) 2.3.2. Cost Aggregation Cost aggregation is to aggregate the cost of correlated pixels over a support window. The concept of the cost aggregation is that neighboring pixels may be highly correlated to center pixel. The formula of cost aggregation is written as follow . . , ,. ,. ,. ·. , , , ,. . 2.2 . where Costinit is the initial matching cost from the match metrics. The ω is the related weight for each cost. The effect of the weight is to limit the influence of unrelated pixels. The cost aggregation helps to improve the quality of low texture area since it is lack of information. However, this work also blurs the edge of the object when the cost of different object is aggregated together. Therefore, the determinant of the weight is of vital important for cost aggregation.. 2.3.3. Disparity Computation The disparity map can be computed from the matching cost or aggregated cost. The simplest way is to select the disparity candidate with minimal cost, and the process of this is called winner-takes-all (WTA). The formula of WTA can be expressed as below . . ,. , ,. , ,. ,. , 0,. , . 2.3 . where dm is the disparity with the minimal cost over a disparity range. The more robust methods with complex disparity optimization will be discussed in 3.2.. 2.4. A Taxonomy Evaluation For the computational stereo algorithms, the ambiguous match leads to the poor quality for computational result. The ambiguous points include the occlusion, low-texture (non-feature), and repetitive patterns. Hence, a taxonomy evaluation [4] is 6.

(21) proposed. The evaluation includes three parts: non-occluded area, total area, and discontinuous area. The test sequence is shown in TABLE 2-2. The four sequences, tsukuba, venus, teddy, and cones, are the most commonly used for performance evaluation. The gray level of the ground represents the depth of the object. The pixel with brighter gray level means it is closer to the camera or observer, and vice versa. For the images of non-occlusion images, the non-occluded regions and occluded regions are represented with white and black color respectively. In the discontinuities images, the regions near depth discontinuities are represented as white; occluded and unknown regions are represented as black, and other regions are represented as gray. The error for different three parts is only evaluated in white regions.. 7.

(22) TAB BLE 2-2 thee test sequennces of the taxonomy evaluation e Tsukkuba. Venus. Input. Ground Truth. Nonocclusion. All. Discontinuities. . 8. Teddy. Conees.

(23) 3. Related Work 3.1. Overview The methods of disparity estimation can be roughly categorized into two types: local and global approaches. Local approach determines the disparity of a pixel based on the similarity of a support window. These methods can iteratively aggregate or regularly diffuse the matching cost over the support window. The local methods have low computation complexity and storage requirement, and they are often adopted by real-time implementations [8]. Global methods define objective energy functions which usually include a data term and a neighboring term. The data term is often a transformed version of the matching cost. The neighboring term is represented with a smoothness penalty to enforce disparity smoothness. Sometimes the neighboring term would also include occlusion penalty and segment constraint to improve the disparity estimation result. This is the major difference that set global methods apart from local methods.. 3.2. Local Approach Among the local methods, the matching cost (dissimilarity measure) often is block sum of absolute difference, normalized cross-correlation, census transform, or mutual information. Local methods often suffer from incorrect disparity estimation at occlusion, low texture, and repeating pattern regions. Although larger supporting window and aggregation iteration improve the stereo matching performance at the low texture and repeating pattern regions, it harms the performance at occlusion region. Because of this trade-off between large and small support windows, the reliable variable window size [9-11] was proposed. The window size depends on the reliability 9.

(24) measurement of current window size. The adaptive window size enhances the depth for low texture area but the issue of occluded and border area still remains. To enhance the performance at the occlude and border area, the shiftable window approach is adopted [12][13] and the combination of adaptive size and shiftable window is discussed in [14]. However, the qualitative result [14] shows that it still difficult result on both low texture and border area. To solve this issue at the both low texture and border area, the concept of adaptive support weight (ADSW) aggregation is proposed by Yoon [15]. This approach adaptively changes the weights in a support window according to the color and spatial distance between the center and neighboring pixels. Consequently, adaptive support weight can achieve the effect of using window with arbitrary size and shape. Once all the weighted sums of costs are computed, they are iteratively recomputed to produce a smoothed dense disparity map. Later, a segmentation support aggregation was proposed [16][17]. The Outlier rejection [16] claimed to have both a very short computation time and good stereo matching performance. Recently, a report [18] shows that Adaptive weight [15] and Segment support [17] outperform than other aggregation based methods. [18-28]. Although adaptive support weight is the state-of-the-art of local methods, the complexity is much more than segment based method [18].. 3.3. Global Approach Global methods assume the disparity map with minimum objective energy should be very similar to the ground truth. Therefore, global methods focus on optimizing the energy function to determine the disparity map. One of the earlier global methods is dynamic programming [29]. This method focuses on optimizing the energy associated with each scanline during disparity estimation. Although dynamic programming takes 10.

(25) the horizontal global information into optimization, vertical correlation between scanlines is not considered. As a result, the disparity map of dynamic programming often exhibit horizontal streaks, thus reducing the quality of the disparity map. Motivated by the need of 2-D optimization during disparity estimation, Roy and Cox [30] proposed to model the disparity-image space as a 3-D grid graph. By finding the min-cut on this graph, the disparity map with optimum energy is found; this optimization algorithm is also known as graph-cut. Unfortunately, the computation and storage requirement for running graph-cut on 3-D grid graph is enormous. Later, Boykov and Kolmogorov proposed the iterative swap and expansion moves [31][32] which also use graph-cut to find the best moves. Unlike Roy and Cox’s method, a simpler two-variable graph structure which can be regarded as a 2-D graph was used in swap and expansion moves. This simpler graph reduces the computation loading of graph-cut. However, the extra iterations of moves compensate the benefit. On the other hand, Scharstein and Szeliski [4] proposed the Bayesian diffusion method which iteratively diffuses support at different disparities according to nonlinear diffusion strength. This is similar to using different weightings within the support window. Later, Sun [15] proposed the belief propagation for disparity estimation based on the concept of the Bayesian diffusion. Essentially, belief propagation is similar to Bayesian diffusion. Both methods propagate information based on probability model between neighboring pixels. However, belief propagation bridges the link of the global energy function with information passing, which is absent in Bayesian diffusion. In addition, belief propagation uses a more complex updating mechanism, which is used to optimize the final energy. As a result, belief propagation has been reported [4][14] to produce disparity maps with much better quality than Bayesian diffusion. Currently, the disparity map produced by the state-of-art methods combine adaptive support weight, 11.

(26) segment constraint, and belief propagation together. Although belief propagation based methods are the leading methods in stereo matching performance, they also suffer from high computational complexity.. 3.4. Adaptive Support Weight Adaptive support weight (ADSW) proposed by Yoon [15] is the state-of-art of local approach, which aggregates the cost with the weight adaptively generated by the color and spatial distance. The concept of ADSW is that the correlation of the neighboring pixels is related to their spatial distance, which is called the proximity weight. The correlation of two pixels is related to their color distance, which is called the color weight. The weight in the cost aggregation formula (2.2) can be represented as . ,. . ∆. ·. ∆. , . 3.1 . where ∆cpq and ∆spq represent the color distance and spatial distance between pixel p and q respectively. The. ,. represents the strength of aggregating the cost. The. color distance of two pixels is measured in the CIELab color space due to it is more perceptually uniform. As the distance between two points in color space increases, it is reasonable to assume that the similarity is decreased for perceptual stimuli. Especially, Euclidean distance correlates strongly with human color discrimination performance. Therefore, the perceptual difference between two colors is represented as . . ,. ∆. 1. . . 3.2 . The strength of aggregating by color similarity is defined as . . ∆. exp. ∆. . . In the same way, the strength of aggregating by proximity is defined as 12. 3.3 .

(27) . ∆. . ∆. exp. . . 3.4 . According to the (3.3)(3.4), the final weight for aggregating can be rewritten as . ,. . ∆. exp. ∆. . . 3.5 . The final weight is the combination of color weight and proximity weight. Hence the cost aggregation can be rewritten as. . ∑. ,. . ∑. ,. ,. ,. ,. ,. ,. , . ,. 3.6 . . where p and q are the corresponding pixels in the reference image, and ,. the corresponding pixels in the target image with disparity value d. the matching cost computed by using the pixels of q and. . are. represents. . When using the truncated. AD (absolute difference), it can be expressed as. . . |. ,. |,. ,. . 3.7 . , ,. where. and. are the reference image and target image respectively. The adaptive. support weight gives a quality result on both low texture and border area; the occluded area can be refined by left-right consistent check.. 3.5. Real‐time Implementations The real-time stereo is essential part for automatic mobile, robot, or any other tracking system. The issues of implementing the real-time systems are the computing complexity, memory size, and bandwidth. Currently, the implementations can be categorized as four types: general purpose process, graphic processing unit (GPU), 13.

(28) digital signal processor (DSP), and application-specific integrated circuit (ASIC).. 3.5.1. General Purpose Processor With the state-of-art processor, some local approach can be implemented to compute the disparity image in real-time. These implementations [33] cannot give a quality result since they are often simple approach. For a more robust and fast implementation of effective aggregation algorithm [34], it can achieve only 18.9 million disparities per second (MDS), the speed is still far from real-time computing. As for the global approach, the complexity of graph-cut and belief propagation is much higher than local approach. These methods often take several minutes to compute one disparity image. However, a recent implementation [35] shows that dynamic programming can be implemented to compute a good disparity result in real-time.. 3.5.2. Graphic Processing Unit Recently, the configurable graphic hardware gives another solution for parallel computing. The programmer can write CUDA (Compute Unified Device Architecture) code, developed by NVIDIA, to accelerate the software. Currently, the solution of using GPU provides extremely high bandwidth from 6.4GB/sec to 128GB/sec. The number of stream processors is up to 256. (The details of using the GPU can refer to GPGPU http://www.gpgpu.org/). With the computing power of GPU and CPU, many algorithms generating high quality result [34] [36] [37] [38] can be implemented in real-time. The programmable graphics hardware is suitable for different stereo algorithms.. 3.5.3. Digital Signal Processing Processor Although the real-time can be implemented by GPU and CPU, the cost is too expensive for embedded applications. For a low cost embedded system, the Digital 14.

(29) Signal Processor (DSP) would be more cost efficient. The DSP provides a SIMD and VLIW instructions, which is very useful for parallel computing for local stereo matching. Some real-time local approach is implemented by using DSP [39][40]. Therefore, the computing power of DSP is limited, and this constraint the development for more accurate disparity estimation algorithms.. 3.5.4. Application-Specific Integrated Circuit Comparing to the GPU, the application-specific integrated circuit (ASIC) has much more flexibility to design the processing element for the algorithms. The matching and data path can be fully customized and achieve high utilization. A simple absolute-difference with variable window size is implemented by hariyama [41], which can achieve high utilization and low. However, the bandwidth issue and internal memory size becomes a bottleneck of designing the hardware. It is a challenge to deal with the intermediate result for the algorithms which requires many times of iteration. The bandwidth requirement of transferring the intermediate result is extremely high and cannot meet the real-time constraint. Besides, the chip area will get large if the intermediate result is stored in the internal memory. The trade-off of the bandwidth and internal memory size becomes the important issue. To solve this problem, the concept of hierarchical approach is proposed. The hierarchical belief propagation (HBP) [42][43] reduces the number of aggregation iteration, and this relaxes the problem of high external bandwidth. Nevertheless, the FPGA implementation of HBP still requires huge block ram. Therefore, although the ASIC design can give a dedicated solution, it is still a challenge to design a low cost real-time architecture with iteratively cost aggregated and disparity optimized algorithms.. 15.

(30) 3.6. Summary Considering the real-time problem, the general purpose processor and DSP has its limitation for the more complexity matching algorithms. The acceleration of using GPU has high potential for implementing high complex stereo algorithms since it has extremely high bandwidth and large numbers of streaming processors. Although the GPU solution may be implemented in the embedded system, it still cost expensive. For a low cost embedded system, the DSP or ASIC may be a more proper candidate. However, the issue of dealing with the intermediate result is big challenge for ASIC solution due to the limitation of the external bandwidth. This results in the high internal memory cost for the ASIC solution.. . 16.

(31) 4. Proposed Mini‐Census Adaptive Support Weight 4.1. Introduction In this chapter, we will introduce the proposed algorithm which is modified from the Adaptive Support Weight [15] introduced in 3.4. We simplified the algorithm and make it applicable for hardware design. Besides, we also improve its capability of dealing with the lighting effect by applying census transform [9]. There are three major challenges of designing the hardware for real-time Adaptive Support Weight. The challenges are the adaptive weight generating function, iteratively cost aggregation and data reuse. We will discuss how we solved the problem of the previous two problems in the proposed algorithm, and discuss the data reuse problem in Chap 5.. 4.2. The Flow of the Proposed Algorithm . Fig. 4‐1 The Flow of the Proposed Algorithm . Fig. 4-1 shows the flow of the proposed algorithm. The proposed algorithm consists of four major steps. First, the mini-census matching cost computation performs mini-census transform on the captured left and right images and computes the initial matching cost of each pixel. The second step is the weight generation which generates the weight coefficients needed in the cost aggregation step. Once the initial matching cost and weight coefficients are available, the matching cost will be aggregate through 17.

(32) a two-pass cost aggregation step. Finally, after the cost aggregation, the disparity map can be obtained by finding the best disparity with the minimum matching cost through a Winner-Takes-All method.. 4.3. Mini‐Census The census transform compares the intensity of each pixel within a support window with the center pixel. If a pixel’s intensity is larger than the center pixel’s intensity, it is given the label 0, otherwise the label 1. The comparison is done in raster-scan order. After the comparison of all pixels within the support window, a binary bitstream is obtained which characterizes the pixel relation between the center pixel and its surrounding pixels. Since the bitstream represents relative information, the census transform is therefore much less sensitive to image bias and gain. In addition, the census transform preserves the depth boundary in disparity maps better than the traditional SAD does. 34 3 13 Census 5 15 23 Transform 2 54 30 Current Block. 0 1 1 1 X 0 1 0 0 bitstream 1 01110100. 4 68 17 Census 61 51 4 Transform 23 3 59 Candidate Block. 1 0 1 0 X 1 1 1 0 bitstream 2 10101110. Hamming Distance = 5. Fig. 4‐2 The census transform and matching . To compute the matching cost, the bitstreams b1 of a pixel in current view and the bitstream b2 of the candidate corresponding pixel in the other view are obtained first, and then the hamming distance between the two bitstreams is computed and taken as the matching cost. The cost can be defined as. . . Cost ( x , y , d ) = H (b1 , b2 ) 18. 4.1 .

(33) ,where H is the hamming distance function. We would refer the hamming distance as the census cost hereon for brevity. Fig. 4-2 illustrates an example of the census transform with a 3x3 support window. The bitstreams of the current pixel position and the candidate corresponding pixel are 01110100 and 10101110. The hamming distance between the bitstreams is 5; hence, the census cost is 5. The Mini-Census is a simplified census transform. Instead of a block of pixels, only 6 significant pixels will be transformed into the bitstream. The Mini-Census can help reducing the internal memory size of storing the matching cost with minor matching performance loss.. 4.4. Weight Generation and Approximation The adaptive weight generation is based on the color distance and proximity. The proximity weight is fixed for a constant size of support window, but the color distance term is not fixed as the support aggregation window changes position. In the original Adaptive Support Weight (ADSW), the color distance weight is generated from the CIE-Lab color space, which uses floating-point numbers to represent a color. However, using floating-point numbers is not friendly for hardware design. Besides, the square-root and exponential function used in color distance computation and color weight generation are not hardware friendly either. To improve the algorithm to be more hardware implementation friendly, we adopted integer-valued color space, approximated color distance, and approximated exponential function. Moreover, we also removed the proximity weight to further reduce computational complexity. The performance of these improvements is explained in the following subsections.. 19.

(34) 4.4.1. The Performance with Different Color Space The color space has great impact on the performance of many image processing algorithms. We evaluate the impact of using different color spaces (YUV, RGB, and CIE-Lab) on the performance of stereo matching. The best parameters for different size of support aggregation are different. Hence, to eliminate the effect of different parameter, we simulate 100 samples for each size of the window to get the best parameter.. 18. RGB Y YUV LAB. 13 11. RGB Y YUV LAB. 16 14. 9. 12. 7. 10. 5. 8. 3. 6 0. 10. 20. 30. 40. 50. 30. 60. 70. 0. 22. 20. 30. 40. 50. 40. RGB Y YUV LAB. 26. 10. 30 25. 14. 20. 10. 70. RGB Y YUV LAB. 35. 18. 60. 15 0. 10. 20. 30. 40. 50. 60. 70. 0. 10. 20. 30. 40. 50. 60. 70. Fig. 4‐3 The performance comparison with different color space . From the Fig. 4-3, the performance of using color spaces with three color components (YUV, RGB, CIE-Lab) is almost the same. The color space with only luminance component has the worst performance since it lack the other two dimensions of the color space. It can be seen that for three-component color spaces, the weight generated from using different color spaces does not have significant impact on the 20.

(35) stereo matching performance. Hence, this implies that we can choose to use any three-component color space that is suitable for the design. Since YUV and RGB can be represented using three unsigned integers instead of CIE-Lab’s three floating-point numbers, YUV and RGB are more suitable for hardware design. We choose to use YUV in our algorithm because it has been reported to slightly outperform RGB in stereo matching.. 4.4.2. The Color Distance In ADSW, the color distance is defined as the Euclidean distance in the color space, which is written as follow . . . . 4.2 . The square root of the Euclidean distance is a nonlinear operator which is difficult for the hardware design. On the other hand, the Manhattan distance is more hardware efficiency. The formula is written as follow . |. . |. |. |. |. |. . 4.3 . TABLE I compares the performance of using the Euclidean and Manhattan color distance. The result shows that the Manhattan is distance is little better than Euclidean distance for different error tolerance and different test sequences. TABLE 4‐1 the result of approximated color distance Method Euclidean Manhattan Euclidean Manhattan. Error Tolerance 0 1. Error Rate % rank 12.2. TSUKUB A 7.95. 11.1. 7.22. 21.7. 16.8. 11.1. 17.3. 3.47. 0.91. 14.3. 11.2. 16.3. 3.08. 0.59. 14.0. 10.1. 21. VENUS. TEDDY. CONES. 21.4. 18.0. 12.2.

(36) 4.4.3. The Effect of Proximity Weight Proximity Weighting reduces the effect of pixels farther from the window center and has been applied to improve the quality of the matching performance. To determine the necessity of applying the proximity weighting, we compare the performance of using and not using proximity weighting. Fig. 4-4 shows the error rate with different support window size. In Fig. 4-4(a), the error rate increased when the window size is too small. The error rate also increases as the window size increases over 27x27. However, the error rate after applying the proximity weighting does not increase while enlarging the window size. This is shown in Fig. 4-4(b). It is the proximity weight that limits the influence of the farther pixels.. 40 25. 25. 35 30 20. 20. 25. 15. 15. 15 10. 10. nonocc all disc rank rms. 20. 10. 5. 5. 0 0. 0. 3 15 27 39 51 63 75 87 99 111 123 135 147 159 171. 3 3 15 15 27 27 39 51 39 63 51 75 63 87 75 99 87 111 99 123 111 135 123 147 135 159 147 171 159 171. 5. Fig. 4‐4 The Peroformance Analsysi of Proximity Weighting . 4.4.4. Quantized Exponential Function The quantized exponential function is the simplification of the original exponential weight generating function and it also helps to reduce the complexity of the aggregation process. The quantized exponential function is a scaled and quantized version of the original function. The quantized exponential function be represented as below.. 22.

(37) . . . 4.4 . , the result of the quantized exponential function is acquired by first multiplying the value of the original exponential function with a scaling factor 2n, and then quantizing it to perserve only a few MSB bits. The scaling maps the floating number to integer number, which is more hardware friendly. The preserving bits help to reduce the complexity of the cost aggregation. In original cost aggregation step, the process is a sum-of-product of the weight vector and cost vector. If the weight is coded with one-heart encoding, the product operator can be simplified to shift operator, which is much more hardware-efficiency. Fig. 4-5 shows the weight from the original and quantized exponential function with different number of preserved bit. The output of the quantized exponential function is multiplied by 64 and the quantized. Fig. 4-5c and Fig. 4-5d are the output of the quantized exponential function with 2 and 1 MSB preserved respectively. Original. x64, Quantized. 1.2. 70. 1. 60 50. 0.8. 40. 0.6. 30. 0.4. 20. 0.2. 10. 0. 0 0. 20. 40. 60. 80. ‐10 0. 20. 40. 60. 80. x 64, Quantize, P 1‐bits. x64, Quantized, P 2‐bits. 70. 80. 60. 60. 50. 40. 40 30. 20. 20 0 ‐20. 0. 20. 40. 60. 10. 80. 0 0. 20. 40. Fig. 4‐5 The weight from quantized exponential function . 23. 60. 80.

(38) 14. 20. 12. 16. Average Error Rate. Average Error Rate. 18. 14 12 10 8 6 4 2 0. 10 NONOCC 8. ALL. 6. DISC. 4 2 0. 1. 4. 16. 64. 256. 0. 2. 4. 6. 8. Preserved MSB Bits. Scaling Factor. Fig. 4‐6 The performance with quantized exponential function . Fig. 4-6 shows the performance of using the quantized exponential function with different scaling factors and number of preserved MSBs. Fig. 4-6a shows that the average error rate is decreasing if the scaling factor is smaller than 32. If the scaling factor is larger than 64, there is no conspicuous difference with the error rate. Hence, with acceptable quality, the smallest scaling factor can be selected as 64. Fig. 4-6b shows that there is no conspicuous difference of all the preserved bits. Therefore, we set the scaling factor as 64 and preserves only one MSB.. 4.4.5. The Final Weight Table After the discussion in 4.1, the weight generating function can be simplified into a mapping table with the YUV color space, discard of the proximity weight, quantized exponential function and Manhattan distance. The table is listed in TABLE 4-2, 4-3. The difference of these two tables is the preserving MSB bits of the quantized exponential function. According to the Fig. 4-6b, TABLE 4-3 would be is more proper for hardware design since the weights of which are all the power of two. As a result, the weight generating Equation (4.4) (4.2) becomes the Equation (4.5) (4.6).. . . , , ,. , , ,. 24. . 4.5 .

(39) . . , , ,. ,. ,. ,. ,. ,. ,. 4.6 . TABLE 4‐2 The weight table of preserving 2 MSB bits Distance. Weight. Distance. Weight. Distance. Weight. Distance. Weight. 0. 64. 8. 20. 16. 6. 24. 2. 1. 55. 9. 17. 17. 5. 25. 1. 2. 48. 10. 12. 18. 4. 26. 1. 3. 40. 11. 12. 19. 4. 27. 1. 4. 36. 12. 10. 20. 3. 28. 1. 5. 24. 13. 10. 21. 3. 29. 1. 6. 24. 14. 8. 22. 2. 7. 20. 15. 6. 23. 2. TABLE 4‐3 The weight table of preserving 1 MSB bit Distance. Weight. Distance. Weight. Distance. Weight. Distance. Weight. 0. 64. 8. 16. 16. 4. 24. 2. 1. 32. 9. 16. 17. 4. 25. 1. 2. 32. 10. 8. 18. 4. 26. 1. 3. 32. 11. 8. 19. 4. 27. 1. 4. 32. 12. 8. 20. 2. 28. 1. 5. 16. 13. 8. 21. 2. 29. 1. 6. 16. 14. 8. 22. 2. 7. 16. 15. 4. 23. 2. 4.5. Aggregation Iteration The aggregation based method refines the depth result by iteratively aggregating the matching cost. The cost aggregation formula is defined as. . . . , ,. ,. ,. ·. , , ,. . 4.7 . ,where Costt and Costt+1 is the aggregated cost at iteration t and t+1, and r are the width and height of the aggregation window. The iterative aggregation poses a challenge for 25.

(40) real-time hardware design due to the inter-iteration dependence which limits the parallelism and the huge memory storage and wide bandwidth requirement. Hence, the reduction of aggregation iterations is important issue. . 41 37 33 29 25 21 17 15 13 11 9 7 5 . 40 ‐50 30 ‐40 20 ‐30 10 ‐20 0 ‐10 . 1 . 4 . 7 . 10 13 16 19 22 25 28 31 34 37 . 41 37 33 29 25 21 17 15 13 11 9 7 5 . 40‐60 20‐40 0‐20. 4 . 7 . 40‐50 30‐40 20‐30 10‐20 0‐10. 1 . 60‐80. 1 . 41 37 33 29 25 21 17 15 13 11 9 7 5 . 10 13 16 19 22 25 28 31 34 37 . 4 . 7 . 10 13 16 19 22 25 28 31 34 37 . 41 37 33 29 25 21 17 15 13 11 9 7 5 . 60‐80 40‐60 20‐40 0‐20. 1 . 4 . 7 . 10 13 16 19 22 25 28 31 34 37 . Fig. 4‐7 The error rate with the aggregation iteration and window size . The best number of cost aggregation iteration is based on the window size and aggregation algorithm. Fig. 4-7 shows the error rate distribution over the aggregation iteration and window size plane based on the ADSW. From the figure, the best iteration number with the lowest error rate is related to the support window size. The cost aggregation with the smaller window size requires more iterations to achieve lower error rate. On the opposite, the aggregation with larger window size requires fewer iterations. Moreover, the area with lowest error rate exists only with larger window size. Hence, the performance with larger window size is better than smaller size. 26.

(41) 35. Non‐Occluded. 35. 30. 30. 25. 25. 20. 20. 15. 15. 10. 10. 5. 5. 0. 0 11 15 19 23 27 31 35 39 43 47 51 55 59. 11 15 19 23 27 31 35 39 43 47 51 55 59. Support Window Size 8. All. Support Window Size. Discontinuities. 20. RANK. 18. 7. 16. 6. 14. 5. 12. 4. 10. 3. 8 6. 2. 4. 1. 2. 0. 0 11 15 19 23 27 31 35 39 43 47 51 55 59. 11 15 19 23 27 31 35 39 43 47 51 55 59. Support Window Size. Support Window Size. Fig. 4‐8 the minimum iteration with different size of support window . Fig. 4-8 shows the minimum iteration to achieve the lowest error rate. The trend of the curve is also plotted on the figure. For the all evaluation regions and the rank, the minimum number of iteration is reduced while the window size increased. Note that if the window size is larger than 39, only one aggregation iteration is required to achieve the lowest error rate. However, it is tough for hardware design to adopt such a larger window size and more than one iteration. Hence, the design must trade some performance with this. As a result, the adopted window size and the number of aggregation iteration are 31 pixels and 1 respectively for this design. The performance is acceptable from Fig. 4-7 and Fig. 4-8. 27.

(42) 4.6. Two‐Pass Cost Aggregation Approximation The window based cost aggregation sums up the cost over the support window with related weight. The process requires high computational resources. Fortunately, the process of window based aggregation is separable [44]. The original formula is written as equation (4.7). The separate aggregation is written as equation (4.8) and (4.9). The first aggregation is processed with vertical direction and the second aggregation is with the horizontal direction. The separate cost aggregation can reduce the computation complexity. For instance, if the window size is (r+1) * (r+1) and the disparity range is D. The original complexity is proportional to O(r2D). For the separate aggregation, the complexity is proportional to O(2rD). Besides, this approximation also helps reducing the internal bandwidth of the hardware design. . T x, y, d. . Costt x, y. j, d · ω x, y, 0, j. . 4.8 . j. . . Costt. 1. T x, i y, d · ω x, y, i, 0. x, y, d. . 4.9 . i. 4.7. Overall Simulation Result . TABLE 4‐4 the effect of different techniques Method Original +MC+2P +MC+2P+ Manhattan +MC+2P+ Manhattan +Truc(64,2) +MC+2P+ Manhattan+Truc(64,1) Original +MC+2P +MC+2P+ Manhattan +MC+2P +Manhattan +Truc(64,2) +MC+2P +Manhattan +Truc(64,1). ET. 0. 1. Error Rate % TSUKUBA VENUS TEDDY CONES 1.85 1.19 13.3 9.79 3.47 0.91 14.3 11.2 3.08 0.59 14 10.1 3.03 0.61 14 10.1 3.06 0.66 13.9 10.1 18.8 8.40 23.9 19.7 12.2 7.95 21.4 18.0 11.1 7.22 21.7 16.8 11.0 7.22 21.6 16.8 11.2 7.17 21.4 16.7 28. Exec. Time(sec) 95.65 4.75 3.12 2.52 1.84 95.65 4.75 3.12 2.52 1.84.

(43) 5. Data Reuse Analysis of Hardware Implementation 5.1. Overview External memory bandwidth and internal memory size have been major bottlenecks in designing VLSI architecture for real-time stereo matching hardware because of large amount of pixel data and disparity range. To address these bottlenecks, this chapter explores the impact of data reuse on disparity-order and pixel-order with the partial column reuse (PCR) and vertically expanded row reuse (VERR) techniques we proposed. The analysis result suggests that the disparity-order reuse with both PCR and VERR techniques is suitable for low memory cost and low external bandwidth design, whereas the pixel-order reuse with both techniques is more suitable for low computation resource requirement. However, the implementation of disparity-order requires high internal bandwidth. Hence, our final implementation adopted a hybrid of both the disparity-order and pixel-order reuse with VERR technique.. 5.2. Architecture Overview . Fig. 5‐1 the overview of hardware architecture 29.

(44) On implementing aggregation based method under real-time constraint, there are many solutions to the data reuse issue. We will use the hardware architecture shown in Fig. 5-1 to explain different solutions. In the matching cost computation, if data reused along the disparity axis is preferred, the computation of all the matching costs of a pixel is computed before jumping to the next pixel. This allows the data within the matching cost support window to be reused. However, the cost aggregation sums the initial matching costs of the same disparity together, which would prefer the initial costs to be output along the spatial X-Y plane than the disparity axis. As a result, to compute the aggregated cost within an aggregation window, all the matching costs at each disparity must be stored before the aggregation can be performed. These initial matching costs form a cuboid in the disparity-spatial D-X-Y space. The volume of this cube represents the memory size needed to store the initial costs. One way to reduce the storage requirement is to avoid the conflict in data reuse direction. For instance, change the reuse direction in the matching cost computation to the X-Y plane so that it meets the processing direction in the cost aggregation. Although doing so removes the conflict between the matching cost computation and the cost aggregation, the conflict between the cost aggregation and the disparity computation exists. To determine the disparity of a pixel, the disparity computation needs to have all the aggregated matching costs at each disparity for that pixel. However, the aggregated costs are generated in the X-Y plane direction, which is different from the direction preferred by the disparity computation. Consequently, additional storage would be required to store the aggregated costs. These conflicts in the data generation and reuse directions play a key role in determining the storage requirement. Therefore, it is important to derive the best data reuse strategy which resolves these conflicts so that the storage requirement can be minimized. 30.

(45) 5.3. Matching Cost Computation Reuse The data reuse in the matching cost computation can be categorized into two types according to the reuse order. The details of these data reuse method are explained below.. 5.3.1. Disparity-Order Reuse Data Reuse Region. X. X. D. Matching Cost Y Right Image. Y Left Image. X. Y. (a) Matching Cost Generating in Disparity Direction Data Reuse Region. X. X. D. X. Matching Cost Y. Left Image. Y Right Image. Y. (b) Matching Cost Generating in XY Plane. Fig. 5‐2 the two data reuse directions with different size of support window . The disparity-order reuse reuses the data in the matching window of different disparities. Fig. 5-2(a) illustrates how disparity-order reuse works. When we compute the disparity of a pixel in the left image, the matching window in the right image would slide leftward within the disparity range. In other words, the matching cost of different disparities for a pixel in the left image is first computed. Then the matching cost computation of the next pixel in the left image is performed. With the disparity-order reuse, the overlapped data within the matching window in the right image shown in Fig. 5-2(a) can be reused to compute the matching cost at different disparities. As a result, if 31.

(46) the pixel data are stored in external memory, there is no need for repeating accesses of the overlapped pixels. Hence, the bandwidth requirement to external memory can be reduced. However, the order of matching cost generation is different from the order of the matching cost consumption in the following cost aggregation step. This would result in additional memory storage requirement.. 5.3.2. Pixel-Order Reuse Comparing to the disparity-order reuse, the pixel-order reuse reuses the data overlapped by the neighboring matching window in both left and right images. Fig. 5-2(b) illustrates the detail of the pixel-order reuse. The matching cost of the same disparity for each pixel is first computed. Then the cost of the next disparity for each pixel is computed. As a result, the matching window in the left and the right images both slides synchronously with the same disparity offset. With the pixel-order reuse, the overlapped data within the matching windows shown in Fig. 5-2(b) can be reused. Therefore, the pixel-order reuse can also reduce the external memory bandwidth requirement. In contrast to the disparity-order reuse, the order of matching cost generation is the same as the order of the cost consumed by the following cost aggregation step. Hence, the buffer size between the two steps can be reduced. However, the data reuse can only be exploited during the cost computation of one single disparity. There is no data reuse between the computations of different disparities. Once all the computation of the previous disparity has been completed for all the pixels in the whole image, pixel data have to be read from the external memory again. Unless all the previously read pixel data could be stored within the internal memory, otherwise repeating external memory accesses are inevitable.. 32.

(47) 5.4. Cost Aggregation Data Reuse In addition to the data reuse in the matching cost computation, there are two data reuse methods in the cost aggregation. The details of these two data reuse methods are explained as follows.. 5.4.1. Partial Column Reuse (PCR) The partial column reuse method reduces the local memory size in the cost aggregation by distributing the computation of aggregated cost to each column. Instead of computing the aggregated cost after all the initial costs in an aggregation window are available, the PCR computes the partial sum of a column after the initial costs of this column are available. As a result, the size of the local memory can be reduced from a window to only one column. Moreover, the partial sum of each column can contribute to the aggregated cost of multiple overlapped windows. Storing partial column cost requires less local memory size than storing all the initial matching costs in a column. Fig. 5-3 illustrates an example of the PCR with a 5x5 aggregation window size. An aggregated cost requires the partial sum of five initial cost columns. With the PCR, the current partial column sum in Fig. 3 can be reused to contribute to the aggregated cost of windows 1 to 5.. Aggregation Windows. Window 3. Window 1. Window 2. Window 4. Window 5. Fig. 5‐3 The partial column reuse (PCR) in 5x5 aggregation window 33.

(48) 5.4.2. Vertically Expanded Row Reuse (VERR) The vertically expanded row reuse reduces the bandwidth requirement to the cost aggregation engine by deliberately access additional rows of initial costs. If there’s no VERR, when the aggregation finishes processing the current row and jumps to the next row, the overlapped data between the windows at the previous row and the current row have to be read from the cost computation engine again. Fig. 4 shows an example of the situation that the data are overlapped. To avoid accessing the already accessed costs, the VERR vertically expand the rows of initial costs to be read so that they can be reused to compute multiple rows of aggregated cost.. Fig. 5‐4 Vertically Expanded row reuse(VERR) . Fig. 5-4 shows how VERR reduces redundant access of the overlapped data. Without the VERR, most of the data in the windows are overlapped for many times. Consequently, these overlapped data are read repeatedly multiple times. In contrast, with the VERR, the portion of overlapped data becomes much smaller than the case without the VERR. Moreover, the overlapped data in the VERR case only overlap once. This implies that with the VERR, the repeating accesses of the overlapped data would be fewer than the case without the VERR.. 34.

(49) Fig. 5-5 plots the relationship between the average access count of an initial matching cost and the value k given an aggregation window size of 25x25. The value k represents the number of expanded rows. It can be observed that the average access count decreases as k increases. This suggests that with more rows expanded, less bandwidth is needed. However, increasing the value of k will also increase the local memory size and computing resource requirement.. Averange Access Count. Access Count VS Expanded Pixels 30 25 20 15 10 5 0 0. 5. 10. 15. 20. 25. Expanded Pixels. Fig. 5‐5 The average access count versus the number of expanded pixel . 5.5. Comparison TABLE I compares the estimated memory size and bandwidth requirement of the disparity-order and pixel-order reuse methods. The target disparity image is 352x288 pixels large with 64 disparity levels. The real-time constraint is 30 fps. The architecture is assumed to operate at 100MHz clock with a 32-bit data port to the external memory. The size of support window in the matching cost computation and cost aggregation are 9x9 and 25x25 pixels respectively.. 5.6. Summary This chapter explores the impact of disparity-order and pixel-order data reuse in the matching cost computation and proposed the partial column reuse (PCR) and 35.