• 沒有找到結果。

3.1.1 Baseline Belief Propagation

The concept of the BP-based algorithm is illustrated in Figure III-1. In the BP-based algorithms, an energy function is generally formulated as

𝐸(𝒅) = ∑ 𝐷(𝑑𝑖)

𝑖∈𝐼

+ ∑ 𝑉(𝑑𝑖, 𝑑𝑗)

𝑖∈𝐼,𝑗∈𝑁𝑒𝑖𝑔𝑕𝑏𝑜𝑟(𝑖) (III-1)

for a 2-D graph in Figure III-1 (a). In this energy function, D is the data cost for each node corresponding to each pixel, V is the smoothness cost for each edge, and d of the energy E is a selected disparity set for all nodes. The two costs can constrain selecting the disparity set d. The data cost D

35

enforces that the correspondences are similar, and the smoothness cost V enforces that the neighboring nodes’ disparities are consistent. To minimize the energy function and acquire an appropriate disparity set d, the BP-based algorithms perform an iterative process called message passing. However, the shortage of the BP-based algorithms is that the energy function may not be convergent definitely.

Nevertheless, the disparity map could approach to a steady state after sufficient iterations.

(a) (b) (c)

Figure III-1 Illustrations of BP

(a) node plane; (b) message passing; (c) belief calculation.

For the requirement of real-time processing, the direct hardware implementation of BP-based algorithms suffers from two design challenges: high computational complexity and storage in the message passing. For the example of 640x480@30fps and the disparity range of 32, the computational complexity is about 1,200 billion operations per second for the message passing, and the storage is about 157Mbytes for messages.

To address above problems, various approaches have been proposed. Felzenszwalb and Huttenlocher [25] proposed an efficient message passing to reduce computational complexity from

O(L

2

) to O(L), and the bipartite message approach to reduce 50% memory cost. Following their

approach, Yang et al. [27] implemented it on a high performance GPU, and Park et al. [28] also designed an array processing architecture on two FPGA boards to achieve the performance of 320x240@30fps but with 880KB on-chip memory. Cheng et al. [29]-[33] proposed a tile-based BP and a fully parallel architecture for each message passing processing element (PE) to reach real-time

i

j

h

M

t-1h→i

M

ti→j

M

t-1g→i

M

t-1k→i

D

g

k

i

j

k h

g

M

Th→i

M

Tj→i

M

Tg→i

M

Tk→i

D

36

processing for the image size of 640x480. Nevertheless, all the implementation still suffers from high memory cost.

In summary, though previous work used parallel PEs to conquer the high complexity, the resulted logic still occupies too much area since each PE needs high area cost. In addition, all the work did not solve the memory cost well due to their fixed memory access approach.

To solve the mentioned problems, we propose a hardware efficient architecture for various BP-based algorithms through three techniques in Section 3.2. For the high memory cost, we propose a spinning-message approach which rearranges the message configuration in an internal memory to save 50% memory cost. In addition, we propose a sliding-bipartite node plane that combines the advantages of previous work to further reduce more memory cost. For the message passing PE, we propose a buffer-free PE architecture which removes all the large buffers and shares common operators to reduce logic cost without significant speed degradation. Both the proposed low memory access approaches and the buffer-free PE architecture could be applied to various BP-based algorithms together to significantly reduce their hardware cost as well as speed up to real-time processing without changing their disparity accuracies.

3.1.2 Joint Bilateral Upsampling

The JBU algorithm [79] is proposed to scale up the various results of image processing, such as tone mapping, colorization, photomontage, disparity map, and etc. The main idea of JBU is to apply a high resolution image to guide the upsampling process. For upsampling a disparity map, given the high resolution image IH and the low resolution disparity map DL, the high resolution disparity map DH

can be computed by

𝐷𝐻(𝑐) =1

𝜅 ∑ 𝐷𝐿(𝑞𝐿) ∙ 𝑓(‖𝑐𝐿− 𝑞𝐿‖) ∙ 𝑔(‖𝐼𝐻(𝑐) − 𝐼𝐻(𝑞)‖)

𝑞𝐿∈S

, (III-2)

where f is the spatial kernel with the argument of spatial distance ||cL - qL|| in low resolution, and g is the range kernel with the argument of color distance ||IH(c)– IH(q)|| in high resolution. Note that the

37

positions c, q are in the high resolution frame, and the positions cL, qL are their corresponding positions in the low resolution frame. Both the two kernels are Gaussian weight function. In addition, κ is the sum of weights for normalization, and S is the window of spatial kernel.

Based on the original JBU algorithm, various modified JBU algorithms are proposed with different equation. Chan et al. [80] proposed the noise aware filter depth upsampling (NAFDU), which adds the range kernel h for low resolution image to reduce the texture copy artifact. The equation of NAFDU is defined as

𝐷𝐻(𝑐) =1

𝜅 ∑ 𝐷𝐿(𝑞𝐿)𝑓(‖𝑐𝐿− 𝑞𝐿‖),𝛼𝑔(‖𝐼𝐻(𝑐) − 𝐼𝐻(𝑞)‖) + (1 − 𝛼)𝑕(‖𝐼𝐿(𝑐𝐿) − 𝐼𝐿(𝑞𝐿

)‖)-𝑞𝐿∈S

,

(III-3) where α is blending value related to the disparity variance. Using the additional h, the JBU algorithm could resist the texture copy artifact due to its color distance in sampled frame. Thus, the effect of h is increased for the region with low disparity variance. In contrast, the effect of g is increased for the region with high disparity variance.

On the other hand, Riemens et al. [81] proposed the multi-step JBU algorithm that doubles the resolution of disparity map in each iteration. This approach can reduce the computational complexity by decreasing the window size of spatial kernel. In addition, the equation (III-2)is changed to

𝐷𝐻(𝑐) =1

𝜅 ∑ 𝐷𝐿(𝑞𝐿) ∙ 𝑓(‖𝑐𝐿− 𝑞𝐿‖) ∙ 𝑔(‖𝐼𝐿(𝑐𝐿) − 𝐼𝐻(𝑞)‖)

𝑞𝐿∈S

, (III-4)

where the high resolution pixel IH(

q) of (III-2) is replaced with the low resolution pixel I

L(cL). The fast multi-step JBU algorithm was implemented by a programmable DSP platform to achieve the throughput of 720x576@50fps [82]. However, it is far from our target throughput due to the limited resource in DSP platform.

For the above different JBU algorithms, their computational characteristics are the same as the joint bilateral filtering (JBF), which is an extended version of bilateral filter (BF). The BF and the JBF are respectively defined as

38 𝐼′(𝑐) =1

𝜅∑ 𝐼(𝑞)𝑓(‖𝑐 − 𝑞‖)𝑔(‖𝐼(𝑐) − 𝐼(𝑞)‖)

𝑞∈𝑆

(III-5) and

𝐽′(𝑐) =1

𝜅∑ 𝐽(𝑞)𝑓(‖𝑐 − 𝑞‖)𝑔(‖𝐼(𝑐) − 𝐼(𝑞)‖)

𝑞∈𝑆

. (III-6)

Therefore, the existing acceleration approaches for JBF and BF could be applied to the JBU algorithm.

The state-of-the-art approaches proposed by Yang et al. [83] and Porikli [91] can achieve constant time complexity. But they suffer from extremely high memory cost. This dissertation focuses on the Porikli’s approach for JBF because we could take advantage of its computational characteristic of single iterative raster-scan to reduce its memory cost.

The following two sections will analyze the computation of the baseline BP algorithm and the JBF algorithm, and propose an architecture design for the key components to solve their design challenges.