PA-IBS Hardware Architecture - Support of B-frame and 8x8 Block Search

C. Support of B-frame and 8x8 Block Search

4.5 PA-IBS Hardware Architecture

The PA-IBS architecture is a hardware-efficient power adaptive design based on both the IBS algorithm in Section 4.3 and the design approaches in Section 4.4. The design con-tains a low complexity binary matching architecture to reduce the peak power consumption, and the frequency scaling technique for power adaptation to prevent hardware idling.

4.5.1 System Architecture

The system architecture of PA-IBS as shown in Fig. 4.5 contains several on-chip memo-ries for current and reference search data, a 8×1 line search engine to realize region search, pipelined buffers to store the accumulated SODs, and several control units for data and search flow control. The current and reference image data for ME are stored in two sets of local buffers, LM CUR and LM REF, respectively. The LM CUR contains four banks (C0-C3) to store the current block data where each bank is a 16×64 two-port register file.

The LM REF contains nine banks (S0-S8) to store the reference data where each bank is a 64×64 two-port register file. The current and reference block data are received via memory interface (MEM IF). For each access of search data according to the current 8×8 region search, the LM CUR outputs 16×16 bits and LM REF outputs 24×24 bits into two reg-isters arrays, REG CUR and REG REF. The region search is realized by implementing 8 lines of 8×1 line searches. The 8×1 line search engine designed for low complexity binary matching gets 256 bits from REG CUR and 16×24 bits from REG REF for the parallel 8×1 binary search each cycle. For IBS with multiple iterations, the hardware accumulates SODs from each iteration and stores it in the pipelined buffers (PB0-PB7). The line search engine needs totally eight cycles to complete one 8×8 region search for each iteration, and

Chapter 4: Power Adaptive Iterative Binary Search (PA-IBS) 103

(8 × φ) cycles for φ iterations. The pipelined buffers are designed to store the accumulated SODs. There are totally eight lines of 8×1 line search results needed for buffering in one 8×8 region search, so eight pipelined buffers are used.

Once the search engine completes one region search, the search locations are output from the vector generator (VG) to the decision engine for the final motion vectors selection using the information from both search locations and its associated SOD. The controller (CTRL) is the control center of the whole motion estimation design. It makes sure the de-cision engine meets the input timing of the SOD results from the pipelined buffers, controls the address generator (AG) to output the address for memory access, clears the accumulated SODs in the pipelined buffers at the beginning of each region search, sends the parameter of the target iterations (φ) to clock generator (CG) for appropriate working frequency, and controls the timing to enable the decision engine.

4.5.2 8×1 Line Search

The 8×1 line search engine realizes the 8×8 region search line by line. To avoid a huge amount of data access and hardware requirement for parallel processing, the 8×8 region search is divided into 8 lines of 8×1 line searches and is processed line by line for smooth data access. Using this architecture, we only need to design for the smallest search unit of one 8×1 line search. Parallel processing of 8 search locations takes only one cycle since the pattern matching criterion uses simple XOR operations for binary images. It takes eight cycles to complete one 8x8 region search.

The line search engine is designed to process in parallel the eight search locations for the 8×1 line search. As shown in Fig. 4.6, the search engine inputs 16×16 bits from

reg-104 Chapter 4: Power Adaptive Iterative Binary Search (PA-IBS)

ister array REG CUR and 16×24 bits from REG REF for each 8×1 searches. For each location in the 8×1 line search, the block matching criterion is 256-bit SOD operation us-ing the same 16×16 bits from REG CUR and different 16×16 bits from the 16×24 bits as the search window of the 8×1 line search. Each 256-bit SOD operation is further parti-tioned into 16 4×4 SOD computations since the smallest search block size in H.264/AVC is 4×4. Then, the accumulated 16 4×4 SODs are stored in the pipelined buffers before out-putting to the decision engine. The SOD results for 4×8 to 16×16 block size are computed by summing the 16 4×4 accumulated SODs in the decision engine to select final motion vectors.

4.5.3 Pipelined Buffers

The pipelined buffers are designed to store accumulated SODs for multiple iterations of IBS. We design eight sets of 1024-bit pipelined buffers to store the accumulated SODs from each 8×1 line search. For multiple iterations of searches, the pipelined buffers store 8 lines of SODs from each 8×1 line search to accumulate the SODs from each search. After all the iterations of searches are done, the pipelined buffers output the final SOD sums to the decision engine. Fig. 4.7 shows the data output timing from the 8×1 line search engine to the pipelined buffers for φ iterations of binary searches. If φ equals to one, the final SOD is generated from cycle 9 to cycle 16. If there are multiple iterations, the SODs for row 0 are accumulated with row 0 of next iteration at cycle 9. The final SODs for the eight lines of searches are generated at cycle (8 · φ + 1) to cycle (8 · φ + 8).

Chapter 4: Power Adaptive Iterative Binary Search (PA-IBS) 105

Figure 4.5: Block diagram of PA-IBS architecture.

106 Chapter 4: Power Adaptive Iterative Binary Search (PA-IBS)

Figure 4.6: Architecture of line based search engine.

Figure 4.7: Output data timing from line based search engine to the pipelined buffers for φ iterations of binary searches.

Chapter 4: Power Adaptive Iterative Binary Search (PA-IBS) 107

4.6 Experimental Results and Analysis

4.6.1 Evaluation of Algorithmic Performance

The performance evaluation for PA-IBS without CAM demonstrates better PSNR per-formance as compared to prior power adaptive algorithms. The PA-IBS with CAM ex-periment demonstrates the further complexity reduction in software or power reduction in hardware for similar video quality and coding bit rate.

在文檔中適用於功率受限視訊編碼系統之運動估測演算法與積體電路架構設計 (頁 128-133)