Organization of this thesis - 多重影像標準應用之反向離散餘弦轉換設計

Chapter 1 Introduction

1.2 Organization of this thesis

This thesis is organized as follows. In Chapter 2, we present the IDCT algorithm.

It contains the IDCT basics and the previous work. In addition, we also present the algorithm. Chapter 3 shows the proposed architecture of IDCT design, the matrix calculation algorithm, architecture and optimization. The verification method and simulation result will be shown in Chapter 4. We make a brief conclusion and future work in the last chapter.

Chapter 2 Overview of Inverse Discrete Cosine Transform (IDCT)

In 1974[1], DCT/IDCT is widely used in many video compression applications and standards; the goal of compression is to reduce redundancy. Many different algorithms have been proposed. Among the algorithms, they can be briefly as following: computing DCT directly, fast cosine transform, Memory-based designs [2~3], Adder-based designs [4]. In this chapter, we first introduce the definition of DCT/IDCT and other methods, our algorithm and the method for different matrix type.

2.1 IDCT in Decoder for Different Standards

The IDCT in the decoder for different standards, Fig.2.1 is the IDCT in MPEG-2 standards and Fig.2.2 is in H.264 standards.

Run-Length Decoder Input

Bit-Stream

Inverse Quantization

Inverse DCT

Motion Compensation

Output Video

Frame Stores

Fig.2.1 IDCT in MPEG-2 Decoder

Entropy Decoding Input

Bit- Stream

Inverse Quantization

Inverse DCT Motion Compensation

Intra/

Inter

De-blocking

Filter Intra Frame

Prediction

Output Video Frame

Storage

Fig.2.2 IDCT in H.264 Decoder

2.2 Overview of MPEG Compression Algorithm

The Moving Picture Experts Group (MPEG) standards determine two algorithms in implementing the video compression, First, Block based motion compression is the temporal redundancy for reduce. Second, DCT compression is applied for spatial domain information.

2.2.1 Temporal Redundancy Reduction

The MPEG standard defines three types of pictures for motion compensation, they are, intra coded picture (I-Picture), predictive coded picture (P-Picture), and bi-directional predictive picture (B-Picture). The brief descriptions are listed below and the figure is in Fig.2.3.

I-Picture：Intra coded picture is without refer to other pictures, and it supposes an access points to the random access. And then, I-Picture offers moderate compression.

P-Picture：Predictive coded picture use the past I-Picture or P-Picture for motion compensation. The compression efficiency of P-Picture is better then I-Picture, form [5], I-Picture is three times longer then P-Picture.

B-Picture：Bi-directional predictive picture use the past I-Picture and future P-Picture for motion compression, The compression efficiency is the highest than I-Picture and P-Picture.

I B

Fo rw ard P red ictio n

B P B B

T IM E

B id irectio n al P red iction

Fig.2.3 The Temporal Picture Structure

2.2.2 Spatial Redundancy Reduction

Both the still-image and prediction-error signals have a very high degree of spatial redundancy. The redundancy reductions techniques usable to this effect are many, but because of the block-based nature of the motion compression process, block based techniques are preferred. A frame is first divided into 8×8 blocks of pixels, and the two dimensional DCT is then applied independently on each block. This operation results in an 8×8 block of DCT coefficients in which most of the energy in the original block is typically concentrated in a few low frequency coefficients. A quantizer is applied to each DCT coefficients that sets many of them to zero. This quantization is responsible

for a lossy nature of the compression. Compression is achieved by transmitting only the coefficients that survive the quantization operation and by entropy coding their locations and amplitudes.

2.2.3 The Process of Decoding for MPEG-2

The MPEG-2 standard [6] defines the decoding process, and the mean is not the decoder, the designers and manufacturers can develop their own architecture and apply different algorithms to achieve such decoding process. The decoding process defined in MPEG-2 standards is showed in Fig.2.4. The input data is the first variable-length decoded. Since the data from VLD is zig-zag (Fig.2.5) scanned by the encoder, so the inverse scan module will reconstruct the one-dimension data stream into two-dimension matrix. This two-dimension matrix is then inversed quantized to obtain DCT coefficients. Note that the intra and inter block data will need different inverse quantization processes. The IDCT module transforms the coefficients into image data.

The motion compensation module processes these image data together with motion vectors form VLD to form the decoded data. After proper filtering and transform, the image data are sent to display on the monitor or television.

Veriable Length Decoding Coded

Data

Inverse Scan

Inverse Quantization

Decoded IDCT Motion Data

Compensation

Frame-Store Memory

Fig.2.4 MPEG-2 Decoding Process

Fig.2.5 Zig-zag Scan Order

2.3 Algorithms of Inverse Discrete Cosine Transform

The N×N 2-D DCT is defined as following:

1 1

The N×N 2-D IDCT is defined as following:

1 1 IDCT of an 8×8 block needs 2×8⁴ = 4096 multiplications and additions to complete this IDCT transform. It is not feasible to implement IDCT using so much multiplications and additions, because it is very expensive. Until now, many algorithms have been

finding the method to reduce the amount of multiplications and additions, especially the multiplication. It needs large area and computation time in the chip.

2.3.1 Direct Computation of Two Dimension DCT

The algorithm to compute DCT directly was proposed by Chen, Smith and Tralick [7] the algorithm is explained listed below.

The DCT of an N×1 matrix form is

The below diagram (Fig.2.6) signal flow graph for N=8 using sparse matrix direct computation algorithm.

x0 X0

Fig.2.6 Signal Flow Graph of Direct Computation Algorithm for N=8

sin

The algorithm requires

(

)

N N N real multiplications

− +

(2.6)

If , this algorithm needs 16 multiplications, base on cost down target, it still needs more multiplications for implementation.

8 N =

2.3.2 Parallel Implementations

Cho and Lee [9] introduced the architecture that can be executed on the modified DFT architecture with

(

PE's. The relationship developed in this algorithm will be used later in the derivation of the prime factor DCT algorithm.

)

Bayoumi et al. [10] proposed a systolic array for computing the DFT based on the RNS (residue number system). Fig.2.7 depicts the architecture for 5-point DFT. From Fig.2.7, one can see that

(

^N⁻¹

)

basic PE's are required to compute N-point DFT. Each PE of the array performs the function shown in Fig.2.8.

PE PE PE PE

Fig.2.7 The Example of Bayoumi’s [10] Architecture for 5 Point DFT

a+jb PE

N-point DCT can also be executed on this systolic array. Let the input data sequence be

{

X n n

( )

^, = 0,1, ... ,N−1

}

, and the DFT of ^{X n}

( )

^be

In EQ.2.7, skip a scale factor 1

N for convenience. The architecture proposed in [10], but it requires reverse input order. In most cases, natural order input is preferred since otherwise unnecessary delay and memory are required. In order to have architecture for natural order input, simple modification is needed. Let us denote as the output when input data sequence to this architecture is in natural order and the kernel input is conjugated, i.e., the input is

( )

Thus, by connecting one additional basic PE to the N-point systolic array, as shown in Fig.2.10, we can obtain WN⁻^kA k

( )

at the output. Fig.2.10 depicts the modified DFT architecture for natural order input. Now we shall focus on N-point DCT, which can be executed on this modified DFT architecture for input in natural order. The DCT relationship is given in EQ.2.10.

( ) ( )

Here we also neglect the scale factor 1

N until the end for clarity.

Instead of computing EQ.2.10 directly, one can see that the DCT can also be obtained by the indirect computations of EQ.2.11 and EQ.2.12.

( )

On the systolic architecture of Fig.2.10, if we input

( )

{

U_N⁰, , , U_N⁻¹ U_N⁻² ... ,U_N⁻^N⁻¹

}

instead of

{

W_N⁰, , , W_N⁻¹ W_N⁻² ... ,W_N⁻⁽^N⁻¹⁾

}

, then the output at the right end of the architecture is

( )

Fig.2.9 The Modified Systolic Architecture for 5 Point DFT

PE PE PE PE PE PE

which is equivalent to

( )

N^Nk

( )

B k =U⁻ T k (2.14) Since , the output of the system is equal to if k is even; and it is equal to

( )

¹ ^k

UN⁻ = − ^{T k}

( )

−T k if k is odd.

Thus, if we connect a processor that multiplies

( )

⁻¹ ^k^e^j²^k^N^π^{c k}

( )

^to^{B k}

( )

^{, then}

DCT is finally obtained.

The architecture for the case of 5-point DCT is shown in Fig.2.10. Thus, the total number of PE's required for N-point DCT is (N + 1), if one prefers to arrange the input data in reverse order, the number of PE's for N-point DCT would be N. Since DST (discrete sine transform) is similarly defined as DCT, we can also obtain the DST on this architecture with slight modifications.

2.4 Paper Reference

This section will introduce the DCT/ICDT architecture and methods in the state-of-art-works, lists the main methods and show the architecture diagram of them.

In D.W Kim [14] proposal, the architecture uses hardwired DA method, radix-2 multi-bit coding methods, Fig.2.11 shows the processing element for IDCT even and odd matrix.

5=>25=>2

Fig.2.11 Processing Element for IDCT Even and Odd Matrix from D.W Kim [14]

In A. Madisetti [15] design, it uses hardware multiplications and signed digit representation (12bits cosine coefficients) to implement. Fig.2.12 shows this architecture.

BDEG Matrix Vector Multiplier

DRU Transpose

Memory IDRU

ACF Matrix Vector Multiplier

X Z

Fig.2.12 A. Madisetti’s [15] Process Architecture

This architecture of the chip consists data recorder unit (DRU), two matrix-vector

multiplier units, inverse data recorder unit (IDRU), and transpose memory. Fig.2.13 to Fig.2.15 show the details of the [15] process architecture.

MUXB

MUXA LIFO ADD ADD MUXC MUXD

INSEL

Fig.2.14 ACF Matrix-Vector Multiply Unit from [15]

MULT

Fig.2.15 BDEG Matrix-Vector Multiply Unit from [15]

J.I Guo’s [16] design uses hardwired multiplications, cyclic convolution,signed

digit representation (14bits cosine coefficients), and common sub-expression sharing methods, the architecture is shown in Fig.2.16.

MUX

Fig.2.16 J.I Guo’s [16] Architecture

2.5 The Proposed Algorithm

From [8], an algorithm is described as follows:

The IDCT

( )

( ) ( ) ( )

Let ⁽2 1⁾

( )

and the IDCT can be expressed as

( )

¹^l

( )

2⁽² ¹⁾

If N is even, ^{x k}

( )

separates even and odd values, the formulas are listed below:

( ) ( )

( )

Form EQ.2.22 and EQ.2.23 the equation becomes

(² ¹)

( )

² ¹^l

( )

⁽² ^{1 2}⁾ ² ¹^l

( )

⁽² ^{1 2}^{) (} ¹

( )

² Finally the formulas are

( ) ( )

₍₂ ₁₎

( )

From the upper process, we can use two

N points IDCT to calculate N-point 2 IDCT. Fig.2.17 is the 8-point IDCT flow graph.

( )

⁰

The one dimension 8-point IDCT formula is listed in EQ.2.31 and it is expressed by 8×8 matrix shown in Fig.2.18. The coefficients are listed below.

( )

Because the proposed architecture supports different standards, and they have 8×8 and 4×4 matrix, to use one architecture includes these different matrix, 8×8 matrix need transport to 4×4 architecture, and this transport algorithm is showed in Fig.2.19.

4x4

Fig.2.19 The Transport for 8×8 Matrix

From Fig.2.19, because the coefficient matrix is symmetrical, it can be separated into two matrix-vectors (an 8×8 matrix into two 4×4 matrices) as shown in Fig.2.20.

00 01 02 03 00 01 02 03 10 11 1

2.5.2 IDCT of H.264 High Profile

The 8×8 matrix of H.264 high profile IDCT is shown in Fig.2.21. In order to carry out the matching separation of an 8×8 matrix into two 4×4 matrices of MPEG-2 IDCT, the column placement of multiplicand and multiplicator need to be exchanged.

Specifically, columns #5 and #8, #6 and #7 in multiplicand matrix and rows #5 and #8,

#6 and #7 in multiplicator matrix each needs to be exchanged (the changed result is shown in Fig.2.22), the separated matrix is shown in Fig.2.23

00 01 02 03 04 05 06 07

Fig.2.21 The Original 8×8 Matrix of H.264 High Profile IDCT

00 01 02 03 04 05 06 07

Fig.2.22 The Modified Column #5~#8 in Multiplicand Matrix and Row #5~#8 in Multiplicator Ones of H.264 High Profile IDCT

00 01 02 03 00 01 02 03

2.5.3 IDCT of H.264 Baseline

The H.264 baseline IDCT is a 4×4 matrix as shown in Fig.2.24, and it can be calculated by the 4×4 architecture directly. In other words, it does not need to undergo any modification.

00 01 02 03 00 01 02 03

10 11 12 13 10 11 12 13

20 21 22 23 20 21 22 23

30 31 32 33 30 31 32 33

2 2 2 1

2 1 2 2

2 1 2 1

x x x x X X X X

⎡ ⎤ ⎡ ⎤⎡ ⎤

⎢ ⎥ ⎢ − − ⎥⎢ ⎥

⎢ ⎥= ⎢ ⎥⎢ ⎥

⎢ ⎥ ⎢ − − ⎥⎢ ⎥

⎢ ⎥ ⎢⎣ − − ⎥⎦⎢ ⎥

⎣ ⎦ ⎣ ⎦

Fig.2.24 The 4×4 Matrix of H.264 Baseline IDCT

Chapter 3 Architecture Overview

In this chapter, we will propose the IDCT architecture, input and output interface.

In matrix multiplication calculation block, one dimension systolic array and re-arranging input data flow to the matrix multiplication will be introduced. To match the system requirement and the replacing of multiplication, we introduce some methods to optimize this architecture.

This chapter is organized as follows. In section 3.1, the proposed prototype architecture will be introduced. In section 3.2, we present the overview of one dimension systolic array and modified the dataflow to calculate matrix multiplication.

In section 3.3, consider the system requirement and motion compression, we modify the prototype architecture to match the system request, and the method to optimize the addition amount, then new architecture will be given.

3.1 Propose System Architecture of IDCT

The IDCT operating procedure is shown in Fig.3.1. It consists of two individual one dimension IDCT and one transpose process which is used to combines this two one dimension IDCT.

1-D IDCT DCT

Coeff.

Input

Transpose

1-D IDCT Output

Data 2-D IDCT

Results

Fig.3.1 IDCT Operating Procedure

About the input interface connecting from inverse quantization, an input buffer is inserted after inverse quantization. It saves the DCT coefficients calculated by inverse quantization, in 4 × 4 matrix calculation case, this buffer transfers four 16-bit coefficients and output to IDCT. Another four 16-bit coefficients output controlled by MODE control signal of IDCT block and is enable in 8×8 matrix calculation. The input interface is shown in Fig.3.2.

MODE Selection Output

Buffer Inverse DCT

for 8x8

16 16

Inverse Quantization

Fig.3.2 IDCT Block Input Interface

The block diagram of the prototype of IDCT architecture is shown in Fig.3.3.

Four 16-bit coefficients output from inverse quantization as an input of IDCT.

If the case in 8×8 standard, they will be divided into two 4×4 matrix blocks.

Because this architecture supports three multimedia video standards namely MPEG-2, H.264 high profile, and H.264 baseline, the coefficient matrix (multiplicand matrix) is defined by the two bits signal named “MODE”. The selected MODE outputs the suitable coefficients matrix using in matrix multiplication calculation, then the addition and the subtraction processes are executed to obtain the 4×4 matrix multiplication solutions. Finally, the obtained four 4×4 results are combined to generate the 8×8 matrix. About matrix multiplication calculation block, next section will introduce this method.

Mode Selection: MPEG2,H.264 high profile and H.264 baseline 4x4 systolic arrays

4x4 systolic arrays Transpose

DCT

H.264 high profile (8x8) H.264 baseline (4x4)

Fig.3.3 Block Diagram of Prototype IDCT Architecture

3.2 Overview of One Dimension Systolic Array

For matrix multiplication calculation, one dimension systolic array is used to perform matrix multiplication after it has been modified by input data flow. One dimension systolic has the characteristics of ”regularity” and “inherent parallelism” in input and output data flow. Besides, it also provides easier architecture design and layout. Because the result of IDCT needs to add the result of the motion compensation where the latter is always the bottleneck of the multimedia calculation process. As a result, the IDCT design sacrifices some throughput to reduce the cost. So the method of one dimension systolic array is adopted. In [11], the one dimension systolic array used in the motion estimation; the data flow of the motion estimation is shown in Fig.3.4.

A.D 1121 3111 21 3111 21 31

12 22 3212 22 3212 22 32

13 23 3313 23 3313 23 33 A.D

A.D

A 31 21 1131 21 1131 21 11

32 22 1232 22 1232 22 12

33 23 1333 23 1333 23 13

Reference Data Search Data

M Displacement

Vector

Fig.3.4 One-D Systolic Array Data Flow for Motion Estimation [11]

As Fig.3.5 shows, it can modify the multiplicand and multiplier data flow to perform matrix multiplication calculation, four processing elements (PE’s) in it.

Processing element #1 is a multiplier, and processing elements #2~#4 include

multiplier and adder. The data flow with example also listed in Fig.3.5.

data location of multiplicator data location of multiplicand

T= 1 2 3 4 ….

PE1 11*11 21*12 31*13 ….

11*11 21*12

Fig.3.5 One-D Systolic Array Data Flow for 4×4 Matrix Multiplication

The estimated cycle counts of one dimension systolic array for performing 4×4 matrix multiplication is ^N⁺

(

^N^×^N

)

⁼²⁰^cycles

(

^N ⁼⁴

)

. The architecture is shown in Fig.3.6. Accessing addition and subtraction processes needs 2 cycles; therefore the two-D 8 × 8 matrix in the worst case of MPEG-2 needs

( )

{ }

2× ×2 ⎡⎣N+ N×N +1⎤⎦ =84 cycles. However, the cycle counts of motion compensation and system requirement need under 400 cycles per microblock in 4:2:0 standards, the two-D 8×8 matrix needs 84 cycles required by MPEG-2 is too slow for the system requirements and motion compensation produces the result.

Fig.3.6 One-D Systolic Array Architecture for Matrix Multiplication

Regarding the throughput issue, the one dimension systolic array offers the elasticity required to solve this problem. It is able to add more hardware to increase the throughput, the worst case in IDCT is able to meet the system and motion compensation requirements. The data is listed in Table 3.1.

Table 3.1 Estimation Worst Cycle Counts

Cycle Count (worst case) four PE's

4x4 Matrix N+N²

2-D 8x8 Matrix

(for 4:2:0 system) 2x{2x[(N+N²+1)]}

Standard Size

MPEG-2

If the architecture increases more hardware, and need adjust the input data flow, the data flow is shown in Fig.3.7.

cycle # 1 2 3 4 5 6 7 8 9 10 11 12

a11 a21 a31 a41 a11 a21 a31 a41

a12 a22 a32 a42 a12 a22 a32 a42

a13 a23 a33 a43 a13 a23 a33 a43

a14 a24 a34 a44 a14 a24 a34 a44

i11 i12 i13 i14 i13 i14 i11 i12

i12 i13 i14 i11 i14 i11 i12 i13

i21 i22 i23 i24 i23 i24 i21 i22

i22 i23 i24 i21 i24 i21 i22 i23

i31 i32 i33 i34 i33 i34 i31 i32

i32 i33 i34 i31 i34 i31 i32 i33

i41 i42 i43 i44 i43 i44 i41 i32

i42 i43 i44 i41 i44 i41 i42 i43

PE1-1 a11*i11 a21*i12 a31*i13 a41*i14 a11*i13 a21*i14 a31*i11 a41*i12 PE1-2 a11*i12 a21*i13 a31*i14 a41*i11 a11*i14 a21*i11 a31*i12 a41*i13

a11*i11 a21*i12 a31*i13 a41*i14 a11*i13 a21*i14 a31*i11 a41*i12 a12*i21 a22*i22 a32*i23 a42*i24 a12*i23 a22*i24 a32*i21 a42*i22 a11*i12 a21*i13 a31*i14 a41*i11 a11*i14 a21*i11 a31*i12 a41*i13 a12*i22 a22*i23 a32*i24 a42*i21 a12*i24 a22*i21 a32*i22 a42*i23

a11*i11 a21*i12 a31*i13 a41*i14 a11*i13 a21*i14 a31*i11 a41*i12 a12*i21 a22*i22 a32*i23 a42*i24 a12*i23 a22*i24 a32*i21 a42*i22 a13*i31 a23*i32 a33*i33 a43*i34 a13*i33 a23*i34 a33*i31 a43*i32 a11*i12 a21*i13 a31*i14 a41*i11 a11*i14 a21*i11 a31*i12 a41*i13 a12*i22 a22*i23 a32*i24 a42*i21 a12*i24 a22*i21 a32*i22 a42*i23 a13*i32 a22*i23 a33*i34 a43*i31 a13*i34 a23*i31 a33*i32 a43*i33

a11*i11 a21*i12 a31*i13 a41*i14 a11*i13 a21*i14 a31*i11 a41*i12 a12*i21 a22*i22 a32*i23 a42*i24 a12*i23 a22*i24 a32*i21 a42*i22 a13*i31 a23*i32 a33*i33 a43*i34 a13*i33 a23*i34 a33*i31 a43*i32 a14*i41 a24*i42 a34*i43 a44*i44 a14*i43 a24*i44 a34*i41 a44*i42 a11*i12 a21*i13 a31*i14 a41*i11 a11*i14 a21*i11 a31*i12 a41*i13 a12*i22 a22*i23 a32*i24 a42*i21 a12*i24 a22*i21 a32*i22 a42*i23 a13*i32 a22*i23 a33*i34 a43*i31 a13*i34 a23*i31 a33*i32 a43*i33 a14*i42 a24*i43 a34*i44 a44*i41 a14*i44 a24*i41 a34*i42 a44*i43

c11 c22 c33 c44 c13 c24 c31 c42

c12 c23 c34 c41 c14 c21 c32 c43

value

Fig.3.7 One-D Systolic Array Architecture Refinement for Matrix Multiplication

In 4 × 4 matrix calculation, the cycle counts will be decreased to cycles

(

, 4×4 matrix calculation will decrease 8 cycles. The 16 results of two-D 4 × 4 matrix needs

2 1

N+ N = 2 ^N ⁼⁴

)

( )

2× N+2N =24 cycles, the latency is

cycles, it is scheme in Fig.3.5. Because the architecture has two 4×4 matrix multiplication blocks, when last 4×4 matrix access transpose function, next 4×4 matrix can calculate in the other matrix multiplication blocks, it can promote the throughput.

(

^N⁺²^N

)

⁺^N ^{= 16}

Latency

Operation Cycles

(N+2N)+N cycles

2x(N+2N) cycles

16^th 24^th

Fig.3.8 The Operation Cycles and Latency of Two-D 4×4 Matrix added more PE’s

The 64 results of MPEG-2 two-D 8 × 8 matrix needs

( )

{ }

2× ×2 ⎡⎣ N+2N +1⎤⎦ =52 cycles, and the latency is

cycles and produces 32 results. The scheme is listed in Fig.3.9.

( ) ( )

2×⎡⎣ N+2N + +1⎤ ⎡⎦ ⎣ N+2N + =1⎤⎦ 39

Latency

Operation Cycles

3x(N+2N+1) cycles Produce 32 results

2x2x(N+2N+1) cycles Produce 32 results

39^th 52^th

Fig.3.9 The Operation Cycles and Latency of Two-D 8×8 Matrix added more PE’s

3.3 Refinement of Matrix Multiplication

The matrix multiplication of the prototype IDCT architecture uses multipliers to calculate the result. However, the multipliers take up more space on the chip and hence a method needs to be established to reduce the required chip area by eliminating the multipliers used. In section 3.3.1 canonical signed digit (CSD) and modified canonical signed digit (Modified CSD) will be introduced. In section 3.3.2, both the zero value skip and CD value skip are considered to optimize the architecture.

3.3.1 Canonical Signed Digit (CSD) and Modified Canonical Signed Digit (Modified CSD)

The multipliers transfer two’s complement and use shifters and additions based on the multiplied values or use CSD (Canonical Signed Digit) to transfer multipliers. In general, the conversion of two’s complement number B=b_n₋₁,b_n₋₂,...,b₀ to the CSD from can be described in Fig.3.10. The benefit of CSD is that it optimizes the least 1’s amount.

1, 2,...,

n n

D=d ₋ d ₋ d₀

Start

i=0,c₀=0,b_n=b_n-1

i<n

C_i+1=b_n-1b_i^b_ic_i^b_n+1c_i d_i=b_i+c_i-2c_i+1

i=i+1

End

N Y

Fig.3.10 The CSD Conversion Algorithm

The standard algorithm of the conversion from the two’s complement to the CSD representation does not consider the above conclusion, i.e. treats an addition and subtraction as the same cost operations. The modified CSD [12] has a modified conversion algorithm so that the conversion to the –1 (or 1) symbol (negative one, the subtraction) takes place only if the total number of operations (non-zero symbols) decreases.

The example of results for the two’s complement, CSD and modified CSD from [12] are listed in Table 3.2. When coefficients are three and eleven, the 1’s amounts of three methods are equal, but CSD method includes one addition and subtraction respectively, modified CSD does not have any subtraction. The coefficient is seven, modified CSD has one subtraction, but the number of 1’s is less than two’s complement. When coefficient is twenty-three, CSD includes two subtractions, but modified CSD has one subtraction, and the 1’s amount is also less than two’s

complement.

Table 3.2 An Example of Results For The Standard CSD And Modified CSD Conversions [12]

components

addition addition subtraction addition subtraction

3 11 2 101 1 1 11 2 0

7 111 3 1001 1 1 1001 1 1

11 1011 3 10101 2 1 1011 3 0

23 10111 4 101001 1 2 11001 2 1

(Canonical Signed Digit) (Modified CSD) (Two's complement)

Coeff.

Value Value components components

Value

CSD MCSD

Binary

In the proposed architecture, MPEG-2 and H.264 use CSD and modified CSD method, In MPEG-2, cos(π/4) and cos(π/16) include five and seven +1’s respectively.

In CSD, they contain three and two +1’s, two –1’s respectively. In modified CSD, the amount of +1’s and –1’s of cos(π/4) is same as binary (two’s complement), cos(π/16) is less than binary (two’s complement), the details are listed in Table 3.3.

Table 3.3 The Binary, CSD, and Modified CSD Values of MPEG-2 Standard

Coefficient Binary (Two's complement)

Value Value (14bits) +1's -1's Total 1's

cos(π/4) 0 1_ 0 1 1 0_ 1 0 1 0_ 0 0 0 0 5 0 5

The coefficients in H.264 standard, twelve, six and three include two +1’s, respectively. In CSD, they contain one and two +1’s, one –1, respectively. In modified CSD, the amount of +1 and –1’s of twelve, six and three is same as binary (two’s complement); the details are listed in Table 3.4.

Table 3.4 The Binary, CSD, and Modified CSD Values of H.264 Standard

Coefficient

Value Value (14bits) +1's -1's Total 1's

12 0 0 _ 0 0 0 0 _ 0 0 0 0 _ 1 1 0 0 2 0 2

6 0 0 _ 0 0 0 0 _ 0 0 0 0 _ 0 1 1 0 2 0 2

3 0 0 _ 0 0 0 0 _ 0 0 0 0 _ 0 0 1 1 2 0 2

Coefficient

Value Value +1's -1's Total 1's

12 0 0 _ 0 0 0 0 _ 0 0 0 1 _ 0 1 0 0 1 1 2

6 0 0 _ 0 0 0 0 _ 0 0 0 0 _ 1 0 1 0 1 1 2

3 0 0 _ 0 0 0 0 _ 0 0 0 0 _ 0 1 0 1 1 1 2

Coefficient

Value Value +1's -1's Total 1's

在文檔中多重影像標準應用之反向離散餘弦轉換設計 (頁 13-0)