CHAPTER 3 ARCHITECTURE DESIGN OF MPEG-4 VIDEO TEXTURE CODING17
3.3 DCT/IDCT A RCHITECTURE
A popular approach for the implementation of the 2-D DCT/IDCT is the row-column decomposition method [8], [9]. The 2-D transformation is computed by applying the 1-D DCT/IDCT by rows and, columns to reduce the complexity and hardware cost. A transpose memory is necessary to record the data between the two 1-D DCT / IDCT and the coefficients from column-by-column to row-by-row. We implement the DCT/IDCT architecture based on Weiping Li’s algorithm >. The basic computation performed by the DCT/IDCT is the evaluation of the N x N matrix by the products of Nx1 vectors. The computation of the product of triple matrix are, Z=AXAT, for the DCT and Z=ATXA for the IDCT [12], where A is an 8x8 matrix shown in equation (3-1). Even rows of A are even-symmetric and odd rows of A are odd-symmetric. Thus, we can separate this matrix into the even and odd rows by exploiting the symmetry in the row of A.
EnRD RAddr[3:0]
WR_RAM_Sel Clock
DataIn [35:0]
WAddr [6:0]
EnWR
MC
MC
MC
ValidOut DataOut[8:0]
RAM0 RAM1 Intra
Intra
Intra
Y1 Y2 Y3 Y4
U V
Y1 Y2 Y3 Y4
U V
Resetn
⎥⎥
The 2-D DCT/IDCT architecture of row-column decomposition is shown in Fig. 3-4. The 2-D DCT/IDCT transform is typically separated into two 1-D DCT/IDCT transformation to reduce the area cost and complexity. The row-column decomposition technique is implemented in two ways : DRCD (direct row-column decomposition ) and MRCD (multiplexed row-column decomposition ) shown in Fig. 3-4(a) and Fig.3-4(b), respectively.
The DRCD architecture consist two 1-D DCT/IDCT units and one transpose memory.
Compared with the MRCD, it needs lager hardware cost but fewer latency. The MRCD architecture requires only one 1-D DCT/IDCT unit and one transpose memory. It uses the multiplexer and the de-multiplexer to determine the processing path. Because the row and column share the same 1-D DCT/IDCT unit, next data are not allowed to input when 1-D DCT/IDCT unit is working. It results in the longer latency and the fewer throughput rate. But the MRCD needs smaller area than the DRCD.
Figure 3-4 2-D DCT/IDCT architecture of row-column decomposition
The 1-D DCT/IDCT architecture consists of many processing units as shown in Fig. 3-5.
Both DCT and IDCT have five-stage pipeline architecture. It consists the following blocks.
I. Serial to parallel
A serial-to-parallel unit is needed because the DCT/IDCT requires the 8 pixels input in parallel.
II. Pre-processor
The pre-processing unit of the DCT/IDCT will produce a set of 8 new values according to (3-2). These new values could be computed by the additions and subtractions of combinations of the input pixels.
III. Multiplier-adder
The multiplier-adder based architecture is used to accumulate the eq. (3-3) to the eq. (3-5).
1D DCT / IDCT UNIT
1D DCT / IDCT UNIT
TRANSPOSE MEMORY
X Y Z
1D DCT / IDCT UNIT
TRANSPOSE MEMORY
X
Y
Z D
E M
U X M
U X Y
(a)
(b)
IV. Post-processor
The post-processing unit will produce 8 DCT/IDCT coefficients according to eq. (3-6) V. Parallel to serial
Finally, a parallel-to-serial unit is used to generate the output coefficients in serial.
Figure 3-5 Architecture of 1D DCT/IDCT unit
Figure 3-6 Architecture of multiplier-adder unit MUL 0
MUL 1 MUL 2 MUL 3
Adder 0
Adder 1
Fin_Adder out
( ) ( ) ( )
hardware implementation. Due to the modification of eq.3-2, rewording the m(4) and m(0) in the eq.3-5 are necessary when the process is in the DCT mode. The new equation is described in (3-9). The numbers of the adders in the pre-processor are 12.
( ) ( ) ( ) ( ) ( )
We can rearrange (3-7) to get fewer adders as shown in (3-10). The adders in the post-processor are 14.
The complexity of our 1-D DCT/IDCT algorithm is depicted in Table 3-1. We need four multipliers and three adders in the multiplier-adder module. Twelve and fourteen adders are required in the pre-processor and the post-processor, respectively. One adder is used for rounding after the parallel to serial block. Total number of the adders in our proposed 1-D DCT/IDCT algorithm is 30.
Table 3-1 Complexity of the proposed 1-D DCT/IDCT algorithm
Our proposed
Multipliers 4 Adders 12+3+14+1=30
As shown in Table 3-2, our design uses fewer multipliers to achieve the 1-D DCT/IDCT architecture while comparing with previous work. The design of [11] is also based on the weiping Li’s algorithm and needs 7 multipliers to implement a 1-D DCT/IDCT. The design of [13] is required 9 multipliers and 21 adders to achieve.
Table 3-2 complexity of different 1D DCT/IDCT design
Algorithm Cheng’s[13] Bousselmi’s[11] Our proposed
Multipliers 9 7 4
Adders 21 31 30
Fig. 3-7 shows the read/write action of the DCT/IDCT transpose memory [14]. For the row-column decomposition of the 2D DCT/IDCT, the coefficients of the 1-D DCT/IDCT are written into the transpose memory row-by-row in sequence (0, 1, 2, 3, 4, 5, 6, 7, 8, …). As the coefficients of 1-D DCT/IDCT are written to the address 49, the data in the first column can be prepared to read. After that, the data in the transpose memory will be read column by column, (0, 8, 16, 24, 32, 40, 48, 56, 1,…). As shown in Fig. 3-7 (b), the data written to the address 56 is ready to be read after 8 cycles. At next time slice, other coefficients of the 1-D DCT/IDCT will be written column by column and read row by row. Therefore, reading and writing data in transpose memory can be achieved at the same time under this structure.
Figure 3-7 Read/write action of DCT/IDCT transpose memory
A multiplier-adder based DCT/IDCT architecture is shown in Fig. 3-7. In Fig. 3-7, there are five quantization error sources: 1. The quantization of the coefficients for the row-wise and the column-wise transform (Coeff1 and Coeff2). 2. The wordlength reduction for the outputs of the first and the second multipliers (Adder1 and Adder2). 3. The output of the limiter for the row-wise transform (1D_Out). The most suitable way for deciding the minimum wordlength of the Coeff1, the Coeff2, and the 1D_Out is to compute the overall mean square error. The peak mean error, and the overall mean error are important to determine the minimum wordlength of Adder1 and Adder2 [15]. The optimum wordlength of these five terms in our work are shown in Table 3-3. The IDCT precision of this module meets IEEE IDCT precision standard and the implementation results are shown in chapter 4.
(a) (b)
Figure 3-8 Block diagram of a multiplier-adder based 2-D DCT/IDCT
Table 3-3 The optimized wordlength for the our 2-D DCT/IDCT architecture
Optimized Word Length
Coeff1 13
Acc1 21
1D_Out 16
Coeff2 12
Acc2 20
In order to reduce the hardware cost of the architecture of the DCT/IDCT, rounding is required after a multiplication. In our architecture, rounding is used to truncate the output data of each 1D DCT/IDCT module. We adopt the true rounding method to improve the IDCT precision. For the n x n multiplication, true rounding requires adding a 1 at the nth least significant bit of the product and truncates the least significant n bits of the sum. This process
Multiplier
Coeff ROM
Multiplier
Coeff ROM
Transpose Memory
12
Coeff1
Adder
Adder
Coeff2 1D_out
Round & Clip
9
Row-wise 1D DCT/IDCT Unit
Column-wise 1D DCT/IDCT Acc1
Acc2
Post- process
Post- process
Figure 3-9 Rounded Multiplication Dot Diagram
The core characteristics of the DCT/IDCT architecture have been summarized in Table 3-4.
Table 3-4 Core characteristics of the DCT/IDCT architecture Inputs 9 bits(DCT), 12 bits(IDCT) Outputs 12 bits(DCT), 9 bits(IDCT)
Internal wordlength 16 bits
Technology 0.18-um CMOS
No. of transistors 168,244
Clock size 70 MHz
Mode Selection DCT or IDCT
Block size 8 x 8
Accuracy IEEE std. 1180-1990