Motion Estimation Engine - Integer Pel ME (IME) Module

Chapter 4 Architecture of Bi-directional Binary with Sub-pel Motion Estimation

4.2 Integer Pel ME (IME) Module

4.2.3 Motion Estimation Engine

The ME engine comprises four main components including shared SOD PE, LV1 search unit, LV2 search unit, and LV3 search unit.

A. Shared Sum of Difference Processing element (SOD PE)

In the BBME, there are three search block sizes for each of the three pyramid layers. It uses 4×4, 8×8 and 16×16 block sizes from level one to three respectively. To maximize hardware utilization and minimize hardware cost, a shared PE is designed to compute SOD for different layers with one module. As shown in Figure 22, the SOD is performed in the PE that contains 256 bits XOR operations followed by a 256-bit adder tree. The 256 bits XOR operations are partitioned into 16 blocks of 16-bit XOR operations to provide 16 4×4 SOD

results Si4x4{i=0~15}. Then, the sixteen 4×4 SODs can be accumulated as four 8×8 SODs Si8×

Figure 22. Shared SOD PE design.

B. Level 1 (LV1) Search Unit

The LV1 search unit as shown in Figure 23 is designed based on the proposed parallel architecture to complete the LV1 search with low power consumption. It contains a LV1 controller, a LV1 MV determiner I, and three banks of local memories. To complete the LV1 search, the controller controls the data access from current search data buffer (LM_CUR_LV1) and two reference search data buffers (LM_SW1_LV1 and LM_SW2_LV1) to the shared SOD PEs for SAD calculation. The shared SOD PE is able to compute 16 parallel LV1 search SOD, and returns the results to MV_LV1 determiner for final LV1 MV decision. For B-frame search, the two shared SOD PEs are used to process forward and backward search block in parallel. For P-frame search, the forward search block data is mirrored from LM_SW1_LV1 to LM_SW2_LV1. The controller controls the data flow to be able to let the one SOD PEs to process the one half search locations and the other one to process the other half. Such a design methodology can make the whole design 100% busy.

Figure 24 shows the data processing flow. For search range± 3, there are 7×7 search locations. To meet the design specification of the shared SOD PE, we calculate fourteen 4×4

SODs in one cycle. In the first cycle, r0 and r1 in SW1 and r5 and r6 in SW2 are checked in parallel. Using this method, we can finish the ±3 search in 4 cycles for B-frame search and 2 cycles for P-frame search. Two more extra cycles are needed due to the control overhead. For search range± 7, we have to take 10 and 18 cycles for P-search and B-search respectively.

Figure 23. Hardware architecture of LV1 search.

Figure 24. LV1 search order with ± 3 search range. In this design, each SOD PE checks 14 search locations in parallel.

C. Level 2 (LV2) Search Unit

To fully use the shared SOD PEs and remove the latencies in this fine tuning stages for low power, a hardware efficient architecture is used in LV2 design. The original algorithm

shows two hardware design difficulties. The first design difficulty is that two data accesses are needed to allow voting first and then a cross pattern search. Our solution is to check all the candidates sequentially to avoid the branch operations. The second design difficulty is the use of multiplexer (MUX). A 2-D MUX shown in Figure 26(a) is used to access the reference block from SW buffer. In search range of ±16, the SW size is 24×24 in LV2, so a 24×24 to 8×8 MUX is needed for LV2 search. When the search range becomes wider, the hardware cost for 2-D MUX increases significantly. Our solution is to partition the search binary bitplane into several regions to be stored in different register files. So, access of one reference block only needs partial regions. This can fix the MUX size independent of the search range. Figure 26 (b) shows an example to partition a LV2 search range into 9 regions, and only a 16×16 to 10×10 MUX is needed. However, dividing the search binary bitplane into several register files suffers some overhead for the need of extra address decoder. Table 7 shows the comparison of MUX area in work [10] and the proposed LV2 search. It shows the proposed design can achieve at most 34% saving in hardware area. Figure 25 shows the hardware architecture of LV2 search. In B-frame search, the LV2 binary MB from PPU engine is stored in the on-chip memory of LV2 search unit, LM_CUR_LV2, and the forward and backward SWs are stored in the on-chip memories, LM_SW1_LV2 and LM_SW2_LV2, separately. For each MB search, the LV2 controller (CTRL_LV2) controls on-chip memories data access and passed to the shared SOD PEs for pattern matching. At each cycle, one of the five candidates and its neighboring search locations are checked. The resulting SADs are sent back to MV_LV2 determiner and compared with the minimum SOD stored in MV_LV2 determiner. If a smaller SOD is found, the minimal SOD in determiner is updated. After all the candidate search locations are checked, the MV with the minimum SOD for two directions is outputted at the same time. As for P-frame search, the five candidates are separated into two groups. And candidates in these two groups are checked in parallel to reduce the running cycle. In each cycle, we have to check the ±1 cross pattern for one of the five candidates. However, the shared SOD PE can only output four calculate 8×8 SODs each cycle. According to the experiment result, discarding the left position makes least impact on PSNR performance.

Thus the left search location in the ±1 cross pattern is discarded in our hardware implement.

To analyze the design timing for LV2 search, the proposed architecture takes one cycle to check each candidate. With the use of two shared SOD PEs, the design takes 5 cycles for the 5 candidates for forward and backward search blocks, and 3 cycles for forward only P-frame

search. Including control overhead, the total cycles are 6 and 8 for P and B-frame, respectively.

Mux

Figure 25. Hardware architecture of LV2 search.

24 pixels 24 pixels

(a) (b)

Figure 26. Working flow of LV2 design. (a) Methodology in work [10] which uses one 24×24 to 8×8 MUX (b) Proposed LV2 design which reduces MUX size requirement to be 16×16 to

8×8.

Table 7. Comparison of SW buffer area in work [10] and proposed LV2 search.

The LV3 design replaces the MUX with shifter for design cost saving and power reduction. Level 3 performs ±2 full search so the needed SW size is 20×20. If the architecture with 2D MUX is used, this means we need a 20×20 to 16×16 MUX which is a great hardware cost. Our solution is to adopt the shifter register as SW buffer. Figure 27 illustrates the working flow of the shifter register. If the pixels in column 2 and 3 are needed, the entire registers are one column shifted circularly, so that the required data can be moved to the correct position, column 1 and column 2 for next data processing. Figure 28 shows the architecture of LV3 search. In B-frame search, the LV3 binary MB from PPU engine is stored in the on-chip memory of LV3 search unit, LM_CUR_LV3, and the forward and backward SWs are stored in the on-chip memories, LM_SW1_LV3 and LM_SW2_LV3, separately. For each MB search, the LV3 controller (CTRL_LV3) controls on-chip memories data access and passes to the shared SOD PEs for pattern matching. At cycle 0, SOD in search position (-2, -2) is calculated. At cycle 1~4, 20×20 search area pixels are circularly shifted in the left direction column by column as shown in Figure 29 (a). So, the SODs in search positions (-1, -2), (0, -2),

(1, -2), (2, -2) are obtained sequentially. At cycle 5, 20×20 search area pixels are circularly shifted upward with one pixel as shown in Figure 29 (b), and SOD of search position (2, -1) is calculated. In the similar way, all search locations are covered after cycle 25. The search order in SW2_LV3 is reversed as shown in Figure 30. The SODs are sent back to MV_LV3 determiner for comparison with the minimum SOD stored in MV_LV2 determiner. If a smaller SOD is found, the minimal SOD in determiner is updated. After all the candidate search locations are checked, the MVs with the minimum SOD for two directions are outputted at the same time. For P-frame search, the search locations are separated into two sub-groups and are checked in parallel.

To analyze the design timing for LV3 search, the proposed architecture takes one cycle to check each search location. With the use of two shared SOD PEs, the design takes 25 cycles for the 25 search locations for forward and backward search blocks, and 3 cycles to for forward only P-frame search. Including control overhead, the total cycles are 6 and 8 for P and B-frame, respectively. Including the memory access latency for fetching SW_LV3 data and control overhead, the total cycles are 18 and 34 for P and B-frame, respectively.

Figure 27. Fixed fetching position by adopting shifter register as SW buffer.

Figure 28. Hardware architecture of LV3 search.

(a) (b) (c)

Figure 29. (a) Whole SW1_LV3 is shifted in left direction at cycle 1~4, 11~14, and 21~24 (b) Whole SW1_LV3 is shifted in left direction at cycle 5, 10, 15 and 20 (c) Whole SW1_LV3 is

shifted in left direction at cycle 6~9, and 16~19.

(a) (b)

Figure 30. (a) LV3 ME search order for SW1_LV3 begins from the top-left point and in a snake order (b) LV3 ME search order for SW2_LV3 is the reverse order.

在文檔中一個應用於雙向預測之低功率運動估計模組設計 (頁 46-54)