Chapter 5 Bandwidth Reduction Techniques in Computation Cores
5.4. MCADSW
5.4.4. Bandwidth Reduction Techniques for MCADSW Architecture
Reducing bandwidth requirement is important because available bandwidth is limited.
We proposed partial column reuse (PCR) and access reduction with expanded window (AREW) techniques to reduce the bandwidth requirement.
A. Partial Column Reuse (PCR)
The PCR reuses the data in each column to reduce the memory bandwidth and computation requirements. A column is usually a part of multiple horizontally overlapped windows. Therefore, the data of each column can be shared by the computation of the final result for these windows. The data could be original pixel data or temporary intermediate results. By storing these data, the number of memory access and computation can be reduced.
As a result, each column is only read and computed once.
Fig. 45 illustrates how the PCR is applied to the mini-census generation. The pixels in column x=n contribute to the generation of three mini-censuses. Fig. 46 illustrates how a vertical aggregated cost is shared by 18 horizontally overlapped aggregation windows. This reduced the read count of a pixel column from 18 to 1 for these 18 windows. Since PCR can reuse computation as well, PCR may also be applied to other types of implementations, such
Mini-census template for pixel at x=n-2
Mini-census template for pixel at x=n
Mini-census template for pixel at x=n+2 column at x=n
Fig. 45. Partial column reuse in mini-census transform
as processor-based, DSP-based, GPU-based, and FPGA-based, to reduce computation requirement.
31 census costs
window 0
Vertical Aggregation
j-th vertical aggregated cost for window m column j
window m window 17
column icolumn jcolumn k
window 0 window m window 17
i-th vertical aggregated cost for window m
k-th vertical aggregated cost for window 17 Horizontal
Aggregation
horizontal aggregated cost of window 0
Horizontal Aggregation
Horizontal Aggregation
Vertical aggregated cost from column j in window m
horizontal aggregated cost of window m
horizontal aggregated cost of window 17 Shared by
window 0
Shared by window m
Shared by window 17
Fig. 46. Partial column reuse in cost aggregation
B. Access Reduction with Expanded Window (AREW)
The AREW reduces the bandwidth requirement by deliberately expanding the size of the read window. The expanded window reduces the read count of a pixel by reducing the number of overlapping window containing this pixel. We will explain this using an example of vertically expanded window shown in Fig. 43. Note that we have ignored considering horizontal overlap for the sake of clarity. In this example, the original window size is 5x5 pixels and the number of vertical expanded row is 3.
Fig. 47 (a) illustrates how windows are overlapped vertically without expanding rows.
The first window is located at row n and column k. When the window changes the row position at the end of a horizontal scan, the new window would be vertically overlapped with the old window. As a result, the second window is located at row n+1 and column k. The position of the first window is shown by the box with dashed line. The overlapped region is marked by darker color. Since we only buffer the pixel data within the read window due to cost consideration, the vertically overlapped region must be re-accessed. Consequently, the access count of a pixel is determined by the number of overlapping window containing this pixel. The maximal access count of a pixel is five in this case as shown in the figure.
Fig. 47 (b) illustrates the case with expanded window. With the expanded rows, the vertically enlarged window would result in farther vertical jump distance when a row change happens. As a result, the second window is located at n+4. The maximal access count of each pixel is only 2, which is much smaller than in the case without expanded window. Horizontal expansion also reduces the access count in the same way as the vertical expansion. If we could enlarge the window to the size of the image, the read count of each pixel would be only one. However, expanding the window would also require larger internal storage size and more hardware resource. Therefore, the number of expanded row and column should be carefully selected. In our case, the number of expanded row and the number of expanded
column are both 17. The AREW is applied to the mini-census generation and weight generation.
The bandwidth requirement to external frame memory can be estimated based on the read count of each pixel. The read count of a pixel is determined by the number of times it is overlapped by mini-census transform windows and aggregation windows. In a direct implementation without any data reuse, a pixel is overlapped by 3 mini-census transform windows in the horizontal direction, 4 mini-census transform windows in the vertical direction, 31 aggregation windows in the horizontal direction and 31 aggregation windows in the vertical direction. The read data width of a pixel in the mini-census transform is one byte, whereas the read data width of a pixel in the aggregation is three bytes. As a result, the total bandwidth requirement for a CIF size base image at 30 FPS is about ((7x31x31)x1byte+31x31x3byte)x(352x288)x30FPS = 27.22 GB/s. neglecting the boundary case. If we assume the pixel data read for the weight generation already included the pixel
(a)
Read expanded window at row n, column k 5
3
1+3
Read expanded window at row n+4, column k
Read expanded window at row n+8, column k pixel row is read 2 times
3 expanded rows
(b)
Fig. 47. Example of access count reduction with expanded window, (a) without expanded rows, (b) with 3 expanded rows
data read for the mini-census transform, the total bandwidth can be reduced to (31x31x3byte)x(352x288)x30FPS = 8.17 GB/s. After applying the PCR and AREW bandwidth reduction techniques, the average read count of a pixel can be reduced to 5.17 times. The bandwidth requirement can therefore be reduced to 5.17x3bytex(352x288)x30FPS= 44.99 MB/s. The proposed bandwidth reduction can also be applied to other aggregation based stereo matching architectures to reduce their bandwidth requirement.