3.3 Analysis and Design of Joint Bilateral Filtering
3.3.4 Proposed Architecture
D
70
With above memory reduction methods, the computational flow of JBF in Table III-5 is changed to that in Table III-6. The details of these design techniques are presented below.
Table III-6 Modified computational flow and analysis for a pixel in the integral histogram approach
Process Complexity
Pixel count histogram hcc Loop b=0 to Nb-1
IHcOS(b)=IHcOD(b)+IHcO’S’(b)-IHcO’D’(b) IHcOS(IS) += 1, IHcOS(IQ) -= 1
Pixel intensity histogram hic Loop b=0 to Nb-1
Pixel count histogram hcc Loop b=0 to Nb-1
hcc(b) = IHcOS(b) - IHcOR(b) Pixel intensity histogram hic Loop b=0 to Nb-1
1. Overall Architecture
Figure III-21 shows the overall architecture that contains two parts, interface and core. In this architecture, the image pixels and the IHs are stored at the off-chip and on-chip memory, respectively.
The interface accesses pixels from the off-chip memory through a 64-bit bus, and the core performs the computation of JBF.
In the interface, the access controller allocates the bus priority to the input and output first-in-first-out (FIFO) buffers by round-robin policy. The size of each buffer is associated with off-chip bandwidth. Large buffers can support data reuse schemes to reduce the off-chip bandwidth.
Because of sufficient bandwidth in this architecture, we do not apply any data reuse schemes here, and
71
set its size as 16-pixel to meet the bus width and support ping-pong mechanism for simultaneous reading and writing.
The operations of the architecture are described below with the schedule in Figure III-22, which is hierarchically sliced from a frame to pipeline tiles. The computation of one stripe row requires 90 cycles for the stripe width ws of 60 and the filter window width |S| of 31. Note that this architecture takes 96 cycles for one stripe row, and the last 6-cycles are the bubble cycles for simplifying controlling logic. For the process in a pipeline tile, the access controller in the interface fetches pixels from the off-chip memory into the FIFO buffers. Then the two histogram calculation engines in the core begin to compute hic and hcc, and the convolution engine consecutively produces 8 pixels to the output FIFO buffer. Finally, the interface moves results from the buffer to the off-chip memory.
Figure III-21 Proposed architecture of JBF.
Core
72
Figure III-22 Schedule of the proposed architecture
2. Architecture Components
In the core, the main components are two histogram calculation engines and one convolution engine for the computation in Table III-6, which have high computational complexity as mentioned above. Thus, the proposed R-parallelism method unrolls all computational loops in the range domain R.
The details of this method are described in each engine as follows.
(1)
Histogram Calculation EngineThe histogram calculation engines perform the integration and extraction processes for hcc and hic
as shown in Table III-6. With the R-parallelism method, we design their architectures as shown in Figure III-24, where the selected-bin adder (SBA) is depicted in Figure III-23. These two engines can achieve the throughput of 1 histogram/cycle. Note that the difference of the two engines is that the integral value of SBAs is the source pixel J in the engine hic, instead of the constant 1 in the engine hcc. In addition, all bit widths of data in the engine hic are more than those in hcc by 8 bits.
73
Figure III-23 Selected-bin adder in the histogram calculation engines
(a) (b)
Figure III-24 Proposed architectures of histogram calculation engines hic and hcc
In above architectures, each engine needs to access the five IHs: IHOˊSˊ delay-buffer method, which is presented as follows by data dependency of the associated IHs in two successive cycles. Assume that the pixels S, Sˊ, D, and Dˊ shown in Figure III-20 (d) are located (x,y), (x,y-1), (x-1,y), and (x-1,y-1) in the cycle t, respectively. Hence, their IHs can be notated by
… +
74
𝑆(𝑡): 𝐼𝐻𝑂(𝑥,𝑦), 𝑆′(𝑡): 𝐼𝐻𝑂(𝑥,𝑦−1), 𝐷(𝑡): 𝐼𝐻𝑂(𝑥−1,𝑦), 𝐷′(𝑡): 𝐼𝐻𝑂(𝑥−1,𝑦−1) . (III-21) For the next cycle t+1, their x-coordinates are increased by 1 as follows,
𝑆(𝑡+1): 𝐼𝐻𝑂(𝑥+1,𝑦), 𝑆′(𝑡+1): 𝐼𝐻𝑂(𝑥+1,𝑦−1), 𝐷(𝑡+1): 𝐼𝐻𝑂(𝑥,𝑦), 𝐷′(𝑡+1): 𝐼𝐻𝑂(𝑥,𝑦−1) . (III-22) From the (III-21) and (III-22), we can find that D(t+1) equals S(t), and Dˊ(t+1) equals Sˊ(t). That means
IH
OˊDˊ and IHOD can be obtained by delaying IHOˊSˊ and IHO
S for one cycle, respectively. Therefore, we can use two delay-buffers to avoid accessing IHOˊDˊ
and IHO
D from the on-chip memory, and reduce bandwidth from five IHs to three IHs.
(2)
Convolution EngineThe convolution engine uses the histograms hcc and hic to further compute the result pixel by the kernel calculation and convolution processes in Table III-6. Its architecture is shown in Figure III-25 (a). With the proposed R-parallelism method, the convolution process can achieve the throughput of 1 pixel/cycle. Higher throughput can be further attained by adding the registers at the available cut-lines for pipelining in the figure, which can enable operating frequency be higher.
The R-parallelism method brings high throughput but suffers from large size and large number of range table. For the large size, we take advantages of the symmetry and truncation property of Gaussian function to decrease its size from 256 to 32. In addition, to avoid the large number of range table, we share one table by the table selection module as shown in Figure III-25 (b), which reduces the number of table to one. Note that the result of divisor would directly be in the range of 8-bit because it is used to normalize the sum of pixels with weight (III-10).
75
(a) (b)
Figure III-25 Proposed architecture of (a) convolution engine and (b) its table selection modules
Furthermore, the histogram calculation engines and the convolution engine can be serially connected to achieve the throughput of 1 pixel/cycle. More engines can be used to process multiple cascaded pixels simultaneously for higher throughput. The proposed memory reduction methods could be directly extended to support the processing of multiple pixels.
3.3.5 Implementation Result
Referring to the quality analysis in [91], we select 31 for |S| and 64 for Nb in our implementation.
The proposed architecture of JBF has been implemented by Verilog and synthesized under the 90-nm CMOS technology process. Table III-7 lists the implementation result of the proposed architecture.
The hardware design could achieve the throughput of HD1080p 60 frames/s that is 124 Mpixels/s by 23K-byte memory cost and 356K gate counts.
Convolution Engine
76
Table III-7 Example implementation result of the proposed architecture
Technology Process UMC 90nm
Image Size MxN 1920x1080
Number of Bin Nb 64
Filter Window Size |S|2 31x31
Stripe Width ws 60 Histogram Cal. 97,766 148,649 Convolution 168,333 197,351 Total 276,178 355,917
On-chip Memory (Byte) 23K 23K
Table III-8 compares the hardware costs between the proposed methods and the original integral histogram in different resolutions. With the proposed memory reduction and architecture design techniques, the complexity can be reduced to 0.15%, and the memory requirement can be reduced to 0.003%-0.02%. In addition, the bandwidth for IH (i.e. on-chip bandwidth) can be reduced to 32%-36%, but the bandwidth for pixel (i.e. off-chip bandwidth) is increased to 20.3-132.7 Mbits. Nevertheless, the off-chip bandwidth is affordable by the 64-bit bus processing at 200 MHz.
Table III-8 Comparison of hardware cost per frame Resolution Complexity
Table III-9 compares our proposed hardware design with the previous VLSI implementations.
The previous implementations [94], [97] could support large filtering window but low throughput, while the implementations [95], [96] could reach high throughput for small filtering window only. Our design can not only achieve high throughput but also support large filtering window. Table III-10 compares our design with the other previous GPU and CPU implementations. Comparing to other
77
design, the proposed architecture could efficiently utilize the hardware cost to achieve high throughput.
Table III-9 Previous VLSI implementations of bilateral filtering [94] [95] [96] [97] Our Design Supported Window Size 15x15 BF 3x3 BF-like 5x5 BF 11x11 BF 31x31 BF/JBF Implementation Method Xilinx
Spartan-3
Table III-10 Comparison of different implementations
Support-Pixel-First Target-Pixel-First
Subsampling Bilateral Grid Piecewise-linear Gaussian KD-tree