Inverse Transform Coder - Gate Count - 應用於超高效率H.265標準之高記憶體效率內嵌視訊解碼器

Gate Count

4.2 Inverse Transform Coder

The high data rate 2-Dimensional inverse transform coder is proposed to accelerate the pixel throughput for HEVC system. Certainly, the hardware can suitably arrange the data scheduling

with up to 16 pixels/cycle with parallel processing engines in each 4x4 or 8x8 inverse discrete cosine transform coder. The sections below will describe details of the transform core.

4.2.1 Hardware Sharing

The proposed inverse discrete cosine and sine transform hardware is based on the HEVC algorithm. As shown in Figure 4.4 and Figure 4.5, the core of processing engine contains 8 multipliers and adders/subtractors in discrete cosine transform, while the core of discrete sine transform contains 9 multipliers and 7 adders/subtractors. The rounding is implemented easily by using wire shifting. Moreover, each processing engine can process 4 residuals in two cycles.

Input0

-Figure 4.4: 4x4 inverse discrete cosine transform coder

Therefore, the 4 numbers of processing engine can fluently process a 16x16 coefficient block with 4x4 mode given by the entropy decoderin 16 cycles.

In Figure 4.6, the processing engine of the 8x8 transform coder can share the 4x4 processing engine to reduce the hardware cost with about

Finally, the 4x4 and 8x8 inverse discrete cosine and sine transform coders would be integrated in the I-Frame decoder. With 90nm CMOS technology, the system running frequency is 320MHz which can achieve 5G pixels/sec pixel throughput with the gates. As shown in Table 4.2, the throughput of the transform coder are listed the [18], [19], [20], [21] and [22] below. As the frequency is normalized, the proposed transform coder can achieve high throughput 2x 15x times higher.

Parallel Port

Figure 4.5: 4x4 inverse discrete sine transform coder

PE #1

Figure 4.6: 8x8 inverse discrete sine transform coder

Table 4.2: Pixel throughput in different designs

CSVT TVLSI ISCAS ISOCC TVLSI

Proposed

’06 [18] ’08 [19] ’09 [20] ’10 [21] ’12 [22]

Throughput

800M 100M 149M 800M 167M 5G

(pixels/sec)

Frequency 100MHz 100MHz 149MHz 100MHz 167MHz 320MHz

4.2.2 Time Scheduling for WPP

When it comes to the 4 parallelism of decoder, the high data throughput of the tranform coder can output enough residual data in 1 pipeline stage for the 4 intra predictors and in-loop filters as shown in Figure 4.7. Therefore, the hardware cost can be reduced from the original 4 numbers of transform coders to the 1 as shown in Figure 4.8.

INPUT BUF Transform INTRA DF ALF OUTPUT

Decoder #1 Decoder #2 Decoder #3 Decoder #4 Buffer

Read

Buffer Write

TIME

Figure 4.7: Transform scheduling in WPP

Intra

De-blocking ALF

Transform Coef

Decoder #1

Decoder #2

Decoder #3

Decoder #4

Intra

De-blocking ALF

Intra

De-blocking ALF

Intra

De-blocking ALF

Inverse Transform

Figure 4.8: System design of the WPP architecture

4.3 Intra Prediction

This section is talking about the architecture design of the intra predictor for HEVC stan-dard. Figure 4.9 shown in here is intra predictor architecture with 1-line buffer for Ultra-HD resolution. We partition 3 parts, one is reference sample selection, another is intra filter pro-cessing, the other is write to buffer.

Reference Sample Selection

Upper Reference Buffer

Left Reference Buffer

MEMORYCONTROLLER

Angular Intra Mode

DC Mode

Planar Mode INTRA CONTROLLER

Mode/Size IDCT Info DF Into

Adder OUTPUTCPNTROLLER

Reconstruct

(Frame Width)

Figure 4.9: Architecture design of the intra prediction

4.3.1 Reference Sample Selection

In the intra prediction, due to the limited external memory bandwidth, the required upper reference buffer is used to temporily store the un-filtered pixels after the reconstruction with residual data. As shown in Figure 4.10 , the unfiltered pixels stored in the upper reference buffer which are fetched by the 16x16 block are used for the top 4 4x4 blocks, and the left reference buffer is storing the previous block of the right-most pixels. When the mode ans size information are coming , the reference sample selection will choose what pixels in the upper and left reference buffer would be used based on the intra mode ans intra size. As shown in Figure 4.11, if the mode belongs to the upper direction, then the samples would choose the A.

the range of C are chosen for the left mode directions.

32 bits 32 bits Upper Reference Buffer

Left Reference Buffer

Figure 4.10: Reference sample read of the intra prediction

B A

A B C

Figure 4.11: Reference sample range of the intra prediction

4.3.2 Angular Intra Mode

As for the angular intra prediction, the angle is defined as the displacement of the current pixel and top reference pixel in the vertical prediction. Also, it is defined as the right current

pixel and left column pixel in the horizontal prediction. By using linear interpolation filter, the predicted pixels can be generated by choosing appropriate reference samples. The hardware challenge is the long operation cycles and how to design a reconfigurable intra predictor to handle all of the angular intra mode. As shown in Figure 4.12, the reconfigurable hardware design is proposed. In the first part, the intra mode would first convert into angles and then calculate the position (POSn, n=0˜3)between two input reference samples. During the positive

Mode v.s.

Figure 4.12: Filter engine of the intra prediction

angles or negative angles, the reference samples are divided by main and side array. As the vertical mode is chosen, the postive angle will demand the top reference samples as the main array for prediction. Similarly, the horizontal mode in the positive angle will also demand the left references as the same. Instead, for the negative angle, the projection of the side array into the main array is the first step. In addition, the lookup table and add operation are to select the needed samples for the prediction. For enhancing the prediction throughput to meet the 8Kx4K resolution requirement , we adopt 4-pixel parallelism to achieve 4 predicted pixels each cycle. The Filter engine which consists of 3 adder/subtractor and 1 multiplier is to interpolate

32-tap filter with 2 reference samples. The position (POSn, n=0˜3)between two input reference samples is the displacement calculated in the addition and shift step. Finally, the clip step is implemented only by wire-shifting.

4.3.3 Planar Mode

When the block is coded as planar mode, Figure 4.13(a) depicts that the right column values are produced by eliminating the left column and top right pixels. Also, the bot row values are produced by eliminating the top row and bottom left pixels. Moreover, Figure 4.13(b), the linear interpolation is implemented by the 3 subtractors with bottom row and the shifted top row pixels, the left column and the top right are the same, respectively. This approach in the

4

Top-Right

Bot-Left

Bot-Row

Left-Column

Top-Row

Right-Column

(a) Generate the bottom row and right column values

Top_RowLeft_Column

Bot_Left

Top_Right

Bot_Row

Right_Column

<<2

<<3 Output

(b) Architecture Design of the Planar Mode

Figure 4.13: Planar Mode in Intra Prediction

intra prediction makes the pixel values continuous at the block edges. Therefore, the chances that applying the de-blocking filter to smooth the block effect are unusual. The technique of the planar mode improves the smooth surfaces, while the transform residuals added with predicted pixels would generate apparent blocking effect.

4.3.4 Write to Buffer

As the 16x16 block is coded 16 4x4 intra coded, the intra coding order is adopted doublez scan order. Therefore, as shown in Figure 4.14, after the block 0 has been finished prediction and has been reconstructed with the residual data. The rightmost pixels are to be used for the

Upper Reference Buffer

Left Reference Buffer

Figure 4.14: Write back operation of the intra prediction

right block 1, also, the bottommost pixels are to be stored for the below block 2. If the block 1 has finished, the rightmost pixels would need to store into the left reference buffer for the righ block 4, again, the bottommost pixels would be written to the upper reference buffer for the block 3 to use. When the block 2 begins, the left reference buffer are the previous block reconstructed pixels as usual. Due to the raster order, the reconstructed pixels should be saved on the block boundary in order for the incoming block to use. As a result, the blocks 10, 11, 14 and 15 should save the bottommost pixels to the internal SRAM.

在文檔中應用於超高效率H.265標準之高記憶體效率內嵌視訊解碼器 (頁 57-66)