External Memory Interface
5.3 Combined Inter and Intra Prediction
In this section, we implement the inter and intra prediction with a recon gurable systolic ar-chitecture. First, we map all kinds of computations into the systolic operations such that the different predictions can be performed in a similar way. Second, we synthesize a uni ed sys-tolic architecture from the description of prediction algorithm. Syssys-tolic architecture includes a number of regular and modular processing elements (PEs) that simultaneously process and pass data in a similar way. All PEs regularly pump data in and out such that a regular data
Sec 5.3. Combined Inter and Intra Prediction
ow is maintained [67]. Thirdly, we use several multiplexers to con gure the data paths ac-cording to the coding modes and the motion vectors. Therefore, we can share the array of PEs for all the prediction modes including the inter prediction of luminance component, the inter prediction of chrominance component, and the intra prediction of both components. For the inter prediction, the 2-D interpolation is conducted through two separable 1-D lterings. For the intra prediction, the boundary pixels are reshuf ed before feeding into the systolic array.
In the following subsections, we rstly elaborate the disadvantages of the related work on the design of inter and intra prediction. After that, we present a uni ed architecture that ef ciently combines the inter and intra prediction. Then, we detail the operations of the architecture at different con gurations. Lastly, we compare the proposed architecture with the state-of-the-art designs.
5.3.1 Motivation
The spatial and temporal predictions are essential to video coding ef ciency. The H.264/AVC [6] simultaneously incorporates the inter and intra predictions to remove temporal and spatial re-dundancy. Comparing with the existing standards H.261/2/3 and MPEG-1/2/4, these prediction techniques save up to 50% bit rates while providing similar perceptual quality [68]. However, the coding gain is at the cost of additional computations. In the intra prediction, the mode-adaptive predictor is generated by a 1-D ltering, which is conducted along with the boundary pixels of a block. Similarly, the half-/quarter-pel predictor in the inter prediction is produced through a separable 2-D ltering with the motion compensated blocks of variable size. Both predictions require intensive ltering operations that poses challenges for the real-time appli-cations. Moreover, the adaptive and irregular ltering makes hardware implementation more dif cult. Therefore, there are many related work on the design for the inter and intra prediction.
However, the state-of-the-art designs of the pixel prediction, which includes the inter and intra prediction, pose some disadvantages in the aspect of ef ciency and utilization as following:
1. The inter and intra predictions are always implemented as two separated modules due to the difference in their operations [69][70][71][72][73][74][75][76]. However, in the application of decoder and transcoder, the prediction mode of each macroblock in the input bitstream is known in advance. Thus, using separated hardware resources for the inter and intra prediction causes poor hardware utilization.
2. For the interpolation in the inter and intra prediction, most of the prior works implement the nite impulse response (FIR) lter based on the traditional adder-tree (AT) structure [77][44][78][47][79], where the ltering is implemented by a number of tree-structured adders and shifters. However, in such straightforward implementation, common terms between consecutive ltering operations are not reused at all. Moreover, multiple input samples are simultaneously latched for one ltered output causing higher input band-width.
3. The number of FIR lter in the AT-based design [34][78][47] is designed for the worst case, i.e. all the 4x4 blocks are coded as Inter_4x4 mode that requires 2-D interpolation.
However, the lter utilization is signi cantly decreased for t he block partition that only requires 1-D interpolation. Furthermore, the design for the 4x4 block partition introduces redundant computations in case of larger block partition. In our simulation, the worst case occurs rarely in the actual bitstream such that the AT-based design performs worse system performance on the average.
4. In addition to the less ef cient FIR lter design, the AT-based design is tightly coupled with the external memory [47] or the on-chip data bus [71]. The latency of the external memory could compromise the performance of the prediction module. Besides, the
re-Sec 5.3. Combined Inter and Intra Prediction
dundant data transmission not only increases the transmission power but also degrades the system performance caused by serious bus contention.
To increase the ef ciency and the utilization, we propose a uni ed ltering architecture for the inter and intra prediction. First, we share the data paths for both the inter and intra pre-diction so as to increase hardware utilization and reduce hardware cost. Second, to minimize redundant computations in the pixel predictions, the FIR ltering is implemented by a recon-gurable systolic architecture. Thirdly, our proposed systolic architecture is fully utilized for any kind of interpolation and block partition. Fourthly, we allocate a local FIFO and memory for temporarily buffering the motion-compensated data and the intermediate data such that the motion-compensated data of a block partition is transferred without redundant transmission.
5.3.2 Overall Architecture
The overview architecture of our inter and intra prediction in given in Figure 5.10. Depending on the coding mode, the controller governs the data stream entering and leaving the ports of the uni ed systolic array. In the H.264/AVC, the predictor of a block is created from image samples that are coded in either the previously decoded frames or the current frame. The inter prediction creates the predictor of a block from the previously-coded frames that are stored in the external memory. The motion-compensated data, which is determined by the motion vector and the size of block partition, is pre-fetched into the synchronization buffer before generating the predictor of the current block as mentioned in Section 4.3,. Thus, the source data of inter prediction comes from the on-chip data bus via AHB interface. Due to the con ict between 32-bit data bus and pixel-wise processing granularity of the systolic array, the internal pixel FIFOs are used to harmonize the bus transmission and the pixel interpolation. On the other hand, the intra prediction creates the predictor for a block using the boundary pixels in the
Unified Systolic Array
Figure 5.10: The Architecture of Inter and Intra Prediction
adjacent blocks. That means the systolic array get the data in the local memory that stores the boundary pixels of the adjacent blocks when intra prediction is performed. Therefore, the nite state machine (FSM) controller can select the source data as the input of systolic array by two 5-to-1 multiplexers as shown in Figure 5.10.
In addition to the selection of the input of systolic array, the FSM controller controls the data ow for each inter-coded block which is of variable size and requires various interpolations depended on the motion vector. Due to the sub-pixel resolutions of motion vectors such as 1/2-, 1/4-, and 1/8-pixel, the inter prediction requires intensive computations for the interpolation of motion-compensated full-pixel samples. Speci cally, the 1/2-pixel samples are interpolated from full-pixel samples using the 6-tap FIR lter whose tap is (1, -5, 20, 20, -5, 1). Particularly, when the motion vector of a block points to a certain position, a 2-D interpolation may be