DCT/IDCT A RCHITECTURE - ARCHITECTURE DESIGN OF MPEG-4 VIDEO TEXTURE CODING17

CHAPTER 3 ARCHITECTURE DESIGN OF MPEG-4 VIDEO TEXTURE CODING17

3.3 DCT/IDCT A RCHITECTURE

A popular approach for the implementation of the 2-D DCT/IDCT is the row-column decomposition method [8], [9]. The 2-D transformation is computed by applying the 1-D DCT/IDCT by rows and, columns to reduce the complexity and hardware cost. A transpose memory is necessary to record the data between the two 1-D DCT / IDCT and the coefficients from column-by-column to row-by-row. We implement the DCT/IDCT architecture based on Weiping Li’s algorithm ＞. The basic computation performed by the DCT/IDCT is the evaluation of the N x N matrix by the products of Nx1 vectors. The computation of the product of triple matrix are, Z=AXA^T, for the DCT and Z=A^TXA for the IDCT [12], where A is an 8x8 matrix shown in equation (3-1). Even rows of A are even-symmetric and odd rows of A are odd-symmetric. Thus, we can separate this matrix into the even and odd rows by exploiting the symmetry in the row of A.

EnRD RAddr[3:0]

WR_RAM_Sel Clock

DataIn [35:0]

WAddr [6:0]

EnWR

ValidOut DataOut[8:0]

RAM0 RAM1 Intra

Intra

Y1 Y2 Y3 Y4

U V

Y1 Y2 Y3 Y4

U V

Resetn

⎥⎥

The 2-D DCT/IDCT architecture of row-column decomposition is shown in Fig. 3-4. The 2-D DCT/IDCT transform is typically separated into two 1-D DCT/IDCT transformation to reduce the area cost and complexity. The row-column decomposition technique is implemented in two ways ： DRCD (direct row-column decomposition ) and MRCD (multiplexed row-column decomposition ) shown in Fig. 3-4(a) and Fig.3-4(b), respectively.

The DRCD architecture consist two 1-D DCT/IDCT units and one transpose memory.

Compared with the MRCD, it needs lager hardware cost but fewer latency. The MRCD architecture requires only one 1-D DCT/IDCT unit and one transpose memory. It uses the multiplexer and the de-multiplexer to determine the processing path. Because the row and column share the same 1-D DCT/IDCT unit, next data are not allowed to input when 1-D DCT/IDCT unit is working. It results in the longer latency and the fewer throughput rate. But the MRCD needs smaller area than the DRCD.

Figure 3-4 2-D DCT/IDCT architecture of row-column decomposition

The 1-D DCT/IDCT architecture consists of many processing units as shown in Fig. 3-5.

Both DCT and IDCT have five-stage pipeline architecture. It consists the following blocks.

I. Serial to parallel

A serial-to-parallel unit is needed because the DCT/IDCT requires the 8 pixels input in parallel.

II. Pre-processor

The pre-processing unit of the DCT/IDCT will produce a set of 8 new values according to (3-2). These new values could be computed by the additions and subtractions of combinations of the input pixels.

III. Multiplier-adder

The multiplier-adder based architecture is used to accumulate the eq. (3-3) to the eq. (3-5).

1D DCT / IDCT UNIT

TRANSPOSE MEMORY

X Y Z

1D DCT / IDCT UNIT

TRANSPOSE MEMORY

Z D

E M

U X M

U X Y

(a)

(b)

IV. Post-processor

The post-processing unit will produce 8 DCT/IDCT coefficients according to eq. (3-6) V. Parallel to serial

Finally, a parallel-to-serial unit is used to generate the output coefficients in serial.

Figure 3-5 Architecture of 1D DCT/IDCT unit

Figure 3-6 Architecture of multiplier-adder unit MUL 0

MUL 1 MUL 2 MUL 3

Adder 0

Adder 1

Fin_Adder out

( ) ( ) ( )

hardware implementation. Due to the modification of eq.3-2, rewording the m(4) and m(0) in the eq.3-5 are necessary when the process is in the DCT mode. The new equation is described in (3-9). The numbers of the adders in the pre-processor are 12.

( ) ( ) ( ) ( ) ( )

We can rearrange (3-7) to get fewer adders as shown in (3-10). The adders in the post-processor are 14.

The complexity of our 1-D DCT/IDCT algorithm is depicted in Table 3-1. We need four multipliers and three adders in the multiplier-adder module. Twelve and fourteen adders are required in the pre-processor and the post-processor, respectively. One adder is used for rounding after the parallel to serial block. Total number of the adders in our proposed 1-D DCT/IDCT algorithm is 30.

Table 3-1 Complexity of the proposed 1-D DCT/IDCT algorithm

Our proposed

Multipliers 4 Adders 12+3+14+1=30

As shown in Table 3-2, our design uses fewer multipliers to achieve the 1-D DCT/IDCT architecture while comparing with previous work. The design of [11] is also based on the weiping Li’s algorithm and needs 7 multipliers to implement a 1-D DCT/IDCT. The design of [13] is required 9 multipliers and 21 adders to achieve.

Table 3-2 complexity of different 1D DCT/IDCT design

Algorithm Cheng’s[13] Bousselmi’s[11] Our proposed

Multipliers 9 7 4

Adders 21 31 30

Fig. 3-7 shows the read/write action of the DCT/IDCT transpose memory [14]. For the row-column decomposition of the 2D DCT/IDCT, the coefficients of the 1-D DCT/IDCT are written into the transpose memory row-by-row in sequence (0, 1, 2, 3, 4, 5, 6, 7, 8, …). As the coefficients of 1-D DCT/IDCT are written to the address 49, the data in the first column can be prepared to read. After that, the data in the transpose memory will be read column by column, (0, 8, 16, 24, 32, 40, 48, 56, 1,…). As shown in Fig. 3-7 (b), the data written to the address 56 is ready to be read after 8 cycles. At next time slice, other coefficients of the 1-D DCT/IDCT will be written column by column and read row by row. Therefore, reading and writing data in transpose memory can be achieved at the same time under this structure.

Figure 3-7 Read/write action of DCT/IDCT transpose memory

A multiplier-adder based DCT/IDCT architecture is shown in Fig. 3-7. In Fig. 3-7, there are five quantization error sources: 1. The quantization of the coefficients for the row-wise and the column-wise transform (Coeff1 and Coeff2). 2. The wordlength reduction for the outputs of the first and the second multipliers (Adder1 and Adder2). 3. The output of the limiter for the row-wise transform (1D_Out). The most suitable way for deciding the minimum wordlength of the Coeff1, the Coeff2, and the 1D_Out is to compute the overall mean square error. The peak mean error, and the overall mean error are important to determine the minimum wordlength of Adder1 and Adder2 [15]. The optimum wordlength of these five terms in our work are shown in Table 3-3. The IDCT precision of this module meets IEEE IDCT precision standard and the implementation results are shown in chapter 4.

(a) (b)

Figure 3-8 Block diagram of a multiplier-adder based 2-D DCT/IDCT

Table 3-3 The optimized wordlength for the our 2-D DCT/IDCT architecture

Optimized Word Length

Coeff1 13

Acc1 21

1D_Out 16

Coeff2 12

Acc2 20

In order to reduce the hardware cost of the architecture of the DCT/IDCT, rounding is required after a multiplication. In our architecture, rounding is used to truncate the output data of each 1D DCT/IDCT module. We adopt the true rounding method to improve the IDCT precision. For the n x n multiplication, true rounding requires adding a 1 at the nth least significant bit of the product and truncates the least significant n bits of the sum. This process

Multiplier

Coeff ROM

Multiplier

Coeff ROM

Transpose Memory

Coeff1

Adder

Coeff2 1D_out

Round & Clip

Row-wise 1D DCT/IDCT Unit

Column-wise 1D DCT/IDCT Acc1

Acc2

Post- process

Figure 3-9 Rounded Multiplication Dot Diagram

The core characteristics of the DCT/IDCT architecture have been summarized in Table 3-4.

Table 3-4 Core characteristics of the DCT/IDCT architecture Inputs 9 bits(DCT), 12 bits(IDCT) Outputs 12 bits(DCT), 9 bits(IDCT)

Internal wordlength 16 bits

Technology 0.18-um CMOS

No. of transistors 168,244

Clock size 70 MHz

Mode Selection DCT or IDCT

Block size 8 x 8

Accuracy IEEE std. 1180-1990

在文檔中 MPEG-4材質編碼器之架構設計與實現 (頁 31-40)