Overall architecture - Architecture Design and Implementation

6. Architecture Design and Implementation

6.2. Overall architecture

Fig. 6.1. Proposed architecture of JBF

Fig. 6.1 shows the overall architecture that contains two parts, interface and core. In this architecture, the image pixels and the IHs are stored at the off-chip and on-chip memory, respectively. The interface accesses pixels from the off-chip memory through a 64-bit bus, and the core performs the computation of JBF.

In the interface, the access controller allocates the bus priority to the input and output first-in-first-out (FIFO) buffers by round-robin policy. The size of each buffer is associated with off-chip bandwidth. Large buffers can support data reuse schemes to reduce the off-chip bandwidth. Because of sufficient off-chip bandwidth in this architecture, we do not apply any data reuse schemes here to have lower buffer cost, and set its size as 2x8-pixel, where the value of 8 is to meet the bus width, and the value of 2 is to support ping-pong mechanism for simultaneous reading and writing.

6.3. Interface

Fig. 6.2. Mechanism of input and output data control

In the interface, the round-robin finite state machine (FSM) has six states. State 0 to 4 associate to input FIFO buffers; state values determine which FIFO buffer should take the input of an 8-pixel data. For example, as shown by Fig. 6.2, the FIFO buffer of Ic takes input when state is zero; at the other time, it keeps old stored data. State 5 associates to output FIFO buffer, an 8-pixel packaged result in FIFO buffer of Oc are sent to bus when state is 5; at the other time, this FIFO is loaded with newly processed result from the core.

The FIFO buffer of any input is in 2x8-pixel ping-pong structure. For any time, one of two 8-pixel buffer is in Update mode and the other is in Give mode. The structure is used to make scheduling time easier because it enables buffer to receive

data (by Update mode buffer) and to give data (from Give mode buffer) at the same cycle. By our schedule, The Update-mode buffer will be loaded with an 8-pixel input in a cycle; for example, Fig. 6.3 (a) shows an input is coming and then in Fig. 6.3 (b) the Update mode buffer is loaded with the data. At the same cycle, the Give mode buffer gives out a pixel into the core. The mode will exchange after Update mode buffer is loaded data and Give mode buffer gives out all data as shown by Fig. 6.3 (c).

After the switching, the loaded data starts to pour out and the empty buffer waits to be loaded again as in Fig. 6.3 (d). During the process, the mode exchanges continuously.

(a) (b)

Fig. 6.3. Process of Ping-Pong Structure

(a) input is coming, (b) the next cycle, Update mode buffer loaded by input and Give mode gives out a pixel, (c) ready for mode exchange, (d) after mode exchange.

6.4. Time Schedule

Fig. 6.4. Schedule of the proposed architecture

The operations of the architecture are described below with the schedule in Fig. 6.4, which is hierarchically sliced from a frame to pipeline tiles. The throughput of each pipeline tile is the computational result of 8 pixels. In a pipeline tile, the access controller in the interface first reads pixels from the off-chip memory, and stores them into the FIFO buffers. It takes 5 cycles to switch through 5 states (state 0 to 4) of the round robin FSM. Then the two histogram calculation engines in the core begin to compute h’c and hc, and the convolution engine consecutively produces 8 pixel results which are then sent to the output FIFO buffer. Finally, the interface moves 8-pixel packaged results from the buffer to the off-chip memory at the state 5 of FSM.

This schedule refers to the quality analysis in [34], it uses 31 pixels as window width and sets stripe width to be 60 pixels. Therefore, an HD1080p image is sliced

into 32 stripes and the width of an integral region is 90 pixels. 12 pipeline tiles are required for each row of integral region since each tile can calculate 8-pixel-wide histogram. By fully-pipelined schedule, performing 12 pipeline tiles takes 96 cycles.

To sum up over 32 stripes, for a HD1080p frame, 3,317,760 cycles are needed.

6.5. Design Components

In the core, the main components are two histogram calculation engines and one convolution engine for the TABLE. 6-1 computations, which have high computational complexity as mentioned above. Thus, the proposed R-parallelism method unrolls all computational loops in the range domain R. The details of this method are described in each engine as follows.

6.5.1. Histogram Calculation Engine

The histogram calculation engines perform the integration and extraction processes for hc and h’c as shown in TABLE. 6-1. With the R-parallelism method, we design their architectures as shown in Fig. 6.6, where the selected-bin adder (SBA) is depicted in Fig. 6.5. These two engines can achieve the throughput of 1 histogram per cycle.

Note that the difference of the two engines is that the integral value of SBAs is the source pixel J in the engine h’c, instead of the constant 1 in the engine hc. In addition, all bit widths of data in the engine h’c are more than those in hc by 8 bits.

According to equation (4.2), the integral values, J or 1, should be added into a corresponding bin of guided pixel; at the same time, other bins should keep their origin value. In SBA, before adder, a selector is used to select the corresponding bin;

and after adder, a selector array updates the result back to the corresponding bin. All

the selectors are controlled according as the value of guided pixel.

Fig. 6.5. Selected-bin adder in the histogram calculation engines

(a) (b) Fig. 6.6. Architectures of histogram calculation engines h’c and hc

(a) (b) Fig. 6.7. The delay-buffer method

(a) S’^(t), S^(t) at time=t are delayed to be (b) D’^(t+1), D^(t+1), respectively In above architectures, each engine needs to access the five IHs: IHO’S’

, IHO’D’

, IHOS

, IHOD

, and IHOR

, from on-chip memory in one cycle. To reduce the bandwidth problem, we propose the delay-buffer method, which is presented as follows by data dependency of the associated IHs in two successive cycles. Assume that the pixels S, S’, D, and D’

shown in Fig. 5.5 (d) are located (x,y), (x,y-1), (x-1,y), and (x-1,y-1) in the cycle t, respectively. As shown in Fig. 6.7 (a), their IHs are notated by

)

For the next cycle t+1in Fig. 6.7 (b), their x-coordinates are increased by 1 as follows,

)

can be obtained by delaying IHO’S’

and IHOS

for one cycle, respectively. Therefore, we can use two delay-buffers to avoid accessing IHO’D’

and IHOD

from the on-chip memory, and reduce bandwidth from five IHs to three IHs.

The on-chip memory is divided into two banks, because there are two read demands from the engine. One demand is for IHO’S’

and the other is for IHOR

. As shown in Fig. 6.8, it marks even bank and odd bank of memory with white and dark respectively. It shows that choosing stripe width wb as an even number can make two reading demands from different banks.

Fig. 6.8. On-chip memory with even bank and odd bank

D Phase I

Memory view R S'

Phase II

Fig. 6.9. Schedule phases of on-chip memory

The detail schedule is performed in two alternating phases. With these phases, the even bank and odd bank of on-chip memory are alternatively used for reading and writing as shown by Fig. 6.9. At the phase I, IHO’S’

and IHOR

are read from the even bank and the odd bank, respectively. In the meanwhile, IHOD

is written into the odd bank. Then at the phase II, IHOD

is written into the different (even) bank. As the arrow shows, the written IHOD

replaces the oldest integral histogram (IHO’S’

of the prior phase) since this data will not be used anymore. In the meanwhile, IHO’S’

and IHOR

are read from the odd bank and the even bank, respectively. On the whole, the two phases

Memory view D

R S'

exchange iteratively for the overall engine process.

In the following paragraphs, we will explain the computation of the two histogram calculation engines. Their computation flows are almost the same; therefore, we show the detail only with engine of h’c.

The computation of the SBA I in Fig. 6.6 (a) is defined by (the check point one) ),

The computation of the SBA II in Fig. 6.6 (a) is defined with check point one by ), The integration process result IHOS

is calculated by

which is the same as (5.5). Especially note that the addition and subtraction in (6.5) represents additions and subtractions of all bins respectively. With R-parallelism method, they are implemented by an array of adders. The number of adders is equal to the number of bins Nb. Finally, by using an array of adder as well, the engine performs extraction process defined by (as the notation in Fig. 5.5)

to calculate the histogram of the window h’c.

6.5.2. Convolution Engine

(a) (b)

Fig. 6.10. Proposed architecture

(a) convolution engine and (b) table selection modules

Fig. 6.11. Construction of constant weight table

The convolution engine uses the histograms hc and h'c to further compute the pixel result by the kernel calculation and convolution processes in TABLE. 6-1. Its architecture is shown in Fig. 6.10 (a). With the proposed R-parallelism method, the convolution process can achieve the throughput of 1 pixel per cycle. Higher throughput can be further attained by the available cut-lines for pipelining in the figure, which can enable working clock be higher.

The R-parallelism method brings high throughput but suffers from large size and large number of range table. With 256-level R, for any given target pixel intensity Ic, there should be a corresponding 256-item range table. Therefore, for 256 intensity levels, the amount of all table items should be 256x256. To reduce the range table, we take advantages of the symmetry and truncation property of Gaussian function to decrease its size from 256 to 32. Fig. 6.11 shows a curve shape of Gaussian functions can be truncated by considering required digit. For example, we can truncate values smaller than 2^-8 for keeping 8-bit decimal digits. Furthermore, by taking advantage of symmetry property of Gaussian function, the negative side and positive side are folded together. Finally, a constant weight table is sampled from the folded curve.

Nevertheless, the table size determines the quality so that it should be adjusted to meet the quality demand. In the proposed architecture, we use 32 for example because table of this size is enough to provide sufficient digit precision for usual BF processing (σr < 32).

In addition, to avoid the large number of range table, we share one table by the table selection module as shown in Fig. 6.10 (b), which reduces the number of table to one.

Each table selector chooses a weight from the table for its corresponding bin. For example, if Ic is 2, the selector TS0 selects g(2) for the first bin (represents for intensity 0) and selector TS1 also selects g(2) for the second bin (represents for

intensity 4), etc.. Any bin represents for intensity more than 34 is given 0. Then, 64 selected weights and hc and h’c are sent into multiplier array and adder trees for computation of the equation of (4.1).

6.5.3. Parameters versus hardware cost

TABLE. 6-2 Parameters and their associated engine components Parameter Histogram Calculation

Engine

Convolution Engine

Selected Value Window width |S| On-chip memory size

Signal bit width Signal bit width 31 Range kernel σr Constant weight table size <32

Stripe width ws On-chip memory size 60 Bin number Nb On-chip memory size

Operator array length

Operator array length (adder/ multiplier array)

64 (sr = 4)

There are four main parameters: window width |S|, range kernel parameter σr, stripe width ws, and bin number Nb, influencing hardware cost of the proposed histogram calculation engine and convolution engine. The associated engine components of these parameters are shown in TABLE. 6-2. For example, |S|, ws, and Nb are associated to the on-chip memory size of the calculation engine. This can be easily explained with the equation (5.7): the memory cost for integral histogram is determined by these three parameters.

According to TABLE. 6-2, the function block layout of the core architecture doesn’t have to be redesigned for different parameter selections because these parameters do not affect its operation flow. (Especially note that the operation flow is invariant even to window size since the processes of integral histogram algorithm are independent of window selection.) Instead, these parameters affect the size or the operator number of their corresponding engine components. Therefore, if an application has variant parameter selection demands, the size and the operator number

of equipped engine components in its hardware design must be fulfill the most critical demand. For example, I select 31 as the window size for the proposed architecture since it is larger than the selections of most acceleration algorithms and applications.

This makes sure that my architecture is suitable for most applications.

6.5.4. Summary to design components

Overall speaking, the histogram calculation engines and the convolution engine can be serially connected to achieve the throughput of 1 pixel per cycle. Their function block layouts and operation flows are invariant to parameter selection (even to the window size selection). For further high speed demand, more engines can be used to process multiple cascaded pixels simultaneously for higher throughput. The proposed memory reduction methods could be directly extended to support the processing of multiple pixels. In addition, note that for simpler BF, the histogram calculation engine h’c and its on-chip memory in the core module, and the two input FIFOs in the interface module could be reduced.

6.6. Memory Cost Analysis

132.7Mbits on 60 pixels

23.04KBytes on 60 pixels

(a) (b)

1.0E+00 1.0E+02 1.0E+04 1.0E+06 1.0E+08 1.0E+10

RUM 3.11kCycles on 60 pixels

Fig. 6.12. Analysis of Hardware performance and memory reduction

(a)-(c) Hardware performance per frame with different ws; (d) memory reduction with the proposed methods for ws of 60 (M=1080, N=1920, Nb=64, |S|=31).

In this chapter, we analyze the parameter selection in the proposed memory reduction methods. Show the overall memory reduction by three methods combined.

As the combined memory cost in (5.7), there are three parameters, the window size of space kernel width |S|, the number of bin Nb, and the stripe width ws, where the former two are related to application quality, and the last one is related to target performance. Referring to the quality analysis in [34], we select 31 for |S| and 64 for Nb

as an example to illustrate how to determine ws by considering hardware performance.

Fig. 6.12 (a)-(c) estimates the hardware performance of JBF with different ws for the resolution HD1080p. The memory cost is computed with (5.7) and plotted in Fig. 6.12

(a). The off-chip bandwidth and computation time are calculated by the following equations and plotted in Fig. 6.12 (b) and (c), respectively,

(

N w

)(

S w

)

pix M

(

N w

)

w pix

M / _s | |+ _s−1 ⋅4 + / _s _s⋅2 (6.7)

and

(

N w

)(

S w

)

cycles

M / _s | |+ _s −1 ⋅1 (6.8)

where M(ws+|S|-1) is the stripe area with extended regions, and N/ws is the number of stripe in a frame. For the bandwidth, the term with 4 pixels is required by the integration process, and the other term with 2 pixels is required by other processes.

Since the integration process should additionally perform on the extended integral regions as in Fig. 5.2, its bandwidth is more than the other processes’. For the computation time, the proposed architecture takes 1 cycle to produce 1-pixel integral process result.

The selection of ws is mainly related to the target frame rate. If our target is 30 frames per sec, the constraint of computation cycles is 3.3k; therefore, we could select 60 for ws, as the example used by this chapter (as shown in TABLE. 6-2), when the working clock is 100 MHz. With the choice, the off-chip bandwidth will be 62.2%, and the memory cost can be reduced to 23 Kbytes, which is 0.003% of the original cost as shown in Fig. 6.12 (d).

6.7. Implementation Result

With above selected parameters, the proposed architecture of JBF has been implemented by Verilog and synthesized under the 90-nm CMOS technology process. TABLE. 6-3 lists the implementation result of the proposed architecture. The

hardware design spends less than 300K equivalent gate counts and 23 Kbytes on-chip memory to achieve the throughput of HD1080p 30 frames/sec at the clock rate of 100MHz. Moreover, it can process at 200 MHz by pipelining on the available cut-lines in the convolution engine, and further achieve the throughput of 124 Mpixels per sec for HD1080p at the frame rate of 60 frames per sec.

TABLE. 6-3 Example implementation result of the proposed architecture

Technology UMC 90nm

Image Size MxN 1920x1080

Number of Bin Nb 64

Window Size |S|x|S| 31x31

Stripe Width ws 60

Clock Rate (Hz) 100M 200M

Frame Rate (Frame/Sec.) 30 60

Logic Cost

Excluding Memories (Equivalent Gate-Count)

Interface 9,578 9,917

Histogram Cal. 97,766 148,649 Convolution 168,333 197,351

Total 276,178 355,917

On-chip Memory (Byte) 23K 23K

TABLE. 6-4 compares the complexity, memory requirement, and bandwidths between the proposed methods and the original integral histogram in different resolutions. With the proposed memory reduction and architecture design techniques, the complexity can be reduced to 0.15%, and the memory requirement can be reduced to 0.003%-0.02%. In addition, the bandwidth for IH (i.e. on-chip bandwidth) can be reduced to 32%-36%, but the bandwidth for pixels (i.e. off-chip bandwidth) is increased to 20.3-132.7 Mbits.

(That is, bandwidth per second is about 1200-8000 Mbit for speed of 60-frame-per-second) Nevertheless, the off-chip bandwidth is affordable by the 64-bit bus processing at 200 MHz. (The maximum affordable bandwidth is 12800 Mbit per second.) Note that the stripe width ws is specifically selected for the resolution HD1080p. Thus, it can be re-selected by means of the mentioned analysis in Chapter 6.6 to acquire better performance for another resolution.

TABLE. 6-5 compares our proposed hardware design with the previous implementations. Note that this paper is the first VLSI implementation to the best of author’s knowledge, and thus only other GPU and CPU approaches are listed for reference comparison. Although the throughput is less than that of Bilateral Grid, the proposed design still achieves best performance because of its significantly reduced memory cost. Comparing to other design, the proposed architecture could efficiently utilize the hardware cost to achieve real-time speed and low memory cost.

TABLE. 6-4 Comparison of hardware cost per frame

Resol.

Archi. Design Tech.

VGA 5.1^(0.15%) 23 ^(0.020%) 5,191 ^(36%) 20.3 ^(206%) HD720p 1.5^(0.15%) 23 ^(0.007%) 15,571 ^(34%) 60.8 ^(206%) HD1080p 3.3^(0.15%) 23 ^(0.003%) 33,974 ^(32%) 132.7 ^(200%)

Number of bin Nb=64, Window width |S|=31, Stripe width ws=60 VGA=640x480, HD720p=1280x720, HD1080p=1920x1080

TABLE. 6-5 Comparison of different implementations

Support-Pixel-First Target-Pixel-First Durand and Dorsey

[13]

Subsampling Bilateral Grid Piecewise-linear Gaussian KD-tree

7. Conclusion

The main contribution of this thesis is to propose efficient hardware architecture with three memory reduction methods for real-time integral histogram based JBF. The three proposed memory reduction methods combined reduces the memory cost to 0.003% compare to the original integral histogram based JBF. The efficient hardware architecture can process large amount of parallel histogram bins simultaneously to achieve 1 pixel per cycle high throughput. The ASIC implementation of the architecture can achieve 124Mpixel (60 frames) per second with HD1080p resolution image under 200MHz clock rate. The chip consumes totally 355 K gate counts and 23KBytes internal memory. The off-chip bandwidth requirement is 132.7Mbits per frame, which is 60% of the total bandwidth of 200 MHz clock rate. For higher throughput, the architecture and memory reduction methods can be directly extended to support the processing of multiple cascade pixels.

Future Work

In the thesis, we have proposed efficient architecture for IH based JBF and its design concept is also suitable for any integral image based applications but limited to those use the box spatial kernel. Nevertheless, Mohamed et al. [43] has shown that a more complicated kernel can be approximated by the linear combination of many basic box kernels. This extends the integral image approach to more complex applications. For the complex application, multiple parallel hardware cores of basic box kernel must be put together and thus the overall interface of data transfer and communication, and the analysis of internal memory and bandwidth requirement must be re-estimated elaborately for the best performance.

On the other hand, the proposed architecture is suitable for gray-level image process. For extended use for multi-color channels, extra software or hardware has to

在文檔中即時的積分直方圖基準之聯合雙邊濾波演算法分析與設計 (頁 57-0)