Hardware Architecture - Support of B-frame and 8x8 Block Search

C. Support of B-frame and 8x8 Block Search

3.4 Hardware Architecture

3.4.1 System Architecture

Fig. 3.7 shows the system architecture of the proposed design that implements the BBME algorithm with the parallel hardware support for B-frame search as described in Section III. For B-frame search, two sets of hardware are used to enable forward and backward search in parallel. The 8-bit data of current search block is passed to the pre-processing module (MBPPU) to generate three levels of binary search blocks. The three levels of binary data are stored in the on-chip memories C1-C3. The binary data of forward and backward reference frames are stored in two on-chip memories, S01-S03 and S11-S13, respectively. The memory blocks of C1, S01 and S11 are for LV1. The memory blocks of C2, S02 and S12 are for LV2. The memory blocks of C3, S03 and S13 are for LV3.

For each level of block search, the address generator (AG) controls the access of C0-C3, S01-S03, S11-S13 to provide the necessary current and target block data to the shared SOD processing units for block matching (SOD1 for forward and SOD2 for backward).

The shared processing units firstly decide which level of binary data to be used for calcu-lation according to the control signal from controller (CTRL). Then, it computes the SOD between selected current and reference search blocks. The matching results are sent to the comparator for final motion vector selection considering motion vector cost input from the motion vector generator module (VG).

For the forward only P-frame search, the parallel architecture leaves half the hardware idle. Such an issue is addressed with a parallel P-frame search scheme. In the parallel P-frame search mode, the forward search data from S01-S03 are mirrored to S11-S13. The

Chapter 3: Bi-directional Binary Motion Estimation (BBME) 59

original forward search path through SOD1 module handles search for the odd positions while the backward search path through SOD2 module handles search for the even posi-tions. The AG/CTRL/VG/Comparator modules will send appropriate addresses and signals for each path. Thus, both paths are busy with half execution cycles.

Compared to the conventional serial architecture which processes forward and back-ward search sequentially, this parallel architecture enjoys five major advantages including:

• Less on-chip memory access: Parallel architecture reuses the current block search data, removes redundant on-chip memory access, and thus saves power. This is particularly important for pyramid search structure since the current block data are changed for each level of search.

• Higher overall hardware utilization: In addition to full hardware utilization for ME, all the other modules in the pipeline enjoy higher utilization. Typically ME module takes the longest execution cycles as compared to other modules such as transform or motion compensation (MC), so it can become the design critical path in the overall system. The execution cycles are doubled for B-frame search which leads to more idle cycles for the other modules such as entropy coding or transform. Thus, parallel architecture can not only halve the ME execution cycles but also reduce the idling of other modules. Compared to serial architecture, our parallel architecture decreases execution cycles for both P-frame search and B-frame search. As shown in Table 3.8, the P-frame and B-frame searches take 177 cycles and 148 cycles, respectively.

• Lower working frequency: Lower working frequency is a key factor leading to a low power design. Similar to the previous item, the B-frame ME search is typically the slowest module. In that case, it is the dominant factor to decide the system frequency.

60 Chapter 3: Bi-directional Binary Motion Estimation (BBME)

The parallel B-frame architecture as opposed to serial architecture improves the worst case scenario leading to the lowest system frequency. Combing with the voltage scaling technique, parallel architecture can achieve further power reduction.

• Less penalty in hardware cost: It is less expensive to use parallel architecture for the binary search. If the system were to use full pixel (8-bit) for matching, parallel architecture suffers more increase of hardware. Although the increase in percentage is the same, the binary search algorithm has smaller increase.

• Flexibility for joint optimization of B-frame search: Joint optimization of B-frame search is a widely used encoding technique that jointly considers cost and distortion based on the forward and backward search results to provide better motion vectors.

For a serial architecture, it needs to finish the forward and backward searches first.

Then, the joint optimization can start. Parallel architecture can save cycles and the memory for storing first pass results.

Table 3.4 lists a brief summary to compare serial architecture and our parallel architec-ture. Although the on-chip memory is doubled, the necessary local memory size for the 1-bit parallel architecture is still only one quarter of the 8-bit sequential design. The peak memory bandwidth of the proposed parallel 1-bit architecture is (B1+ 2 × B2) in which B1

represents the current frame on-chip memory bandwidth and (2 × B2) as the bandwidth for two parallel forward and backward reference frames. The execution cycles are algorithms dependent, but the ME with 8-bit data type may need more cycles to complete one location of search. The reason is that binary search can complete one search location matching in one cycle with its simple XOR operation. On the contrary, it is difficult for most ME de-signs with 8-bit data type to achieve single cycle execution except when two-dimensional

Chapter 3: Bi-directional Binary Motion Estimation (BBME) 61

Table 3.4: Comparison for serial and parallel architecture.

Architecture Serial Parallel

systolic array is used. The binary search is more suitable for parallel architecture due to its simplicity.

3.4.2 Macroblock Pre-processing Unit (MBPPU)

The three levels of binary pyramid data for search are generated by MBPPU. The bina-rization processing elements (PE) are the key modules to convert the 8-bit image data into binary format. Inside the binarization PE, it contains a 3 × 3 filtering operation and a com-parator. To support the 3 × 3 filter design as shown in eq. (3.1), three rows of line buffers are needed to output a line of binary data. Fig. 3.8 shows the architecture of MBPPU. For each level of pre-processing, there are three rows of data buffers, several binarization (BIN) PEs, and the row rotators. The three rows of data buffers are used to store three rows of 8-bits data before entering binarization process. The filter operation is to implement filter HA. The number of BINs is designed according to the processing data rate at each level, and the processing rate depends on the width of each row buffer (18 for LV3, 10 for LV2 and 6 for LV1). So, the number of PEs for LV3 to LV1 is 9, 5, and 2. The row rotator outputs correct data for binarization.

62 Chapter 3: Bi-directional Binary Motion Estimation (BBME)

3.4.3 Three Levels of Binary Search

The three levels of binary searches are done by three steps of operations: (1) CTRL informs AG about the current level index for search (2) AG sends addresses to local mem-ories (3) local memmem-ories output data to SOD1 and SOD2 for search. Fig. 3.9 shows the operational flow of processing units of SOD1 and SOD2. The processing unit is basically a 256-bit XOR operation followed by a 256-bit adder tree. To be able to support three search block sizes for LV1 to LV3, the 256 bits XOR operations are partitioned into 16 blocks of 16-bit XOR operations to provide 16 4x4 SOD results S_i^4×4{i = 0 ∼ 15}. Then, the sixteen 4x4 SODs can be accumulated as four 8x8 SODs S_i^8×8 {i = 0 ∼ 3} or one 16x16 SOD S₀^16×16. To make sure the input data and the output motion vectors are moved correctly, the CTRL sends a control signal to inform the processing unit about which level is being processed.

For LV2, we need to process 5 search points in parallel. However, the processing unit can only handle four 8x8 SOD operations in parallel. Thus, the search point opposite to the motion vector direction is abandoned.

在文檔中適用於功率受限視訊編碼系統之運動估測演算法與積體電路架構設計 (頁 84-88)