General splitting and merging of 2-D DCT in the DCT domain

(1)

The 47th

IEEE

International Midwest Symposium on Circuits and Systems

- [ P , l =

General Splitting and Merging

of

2-D DCT in the DCT Domain

. . . 0 1 0 - 0 0 0 0 1 0 0 . . . 0 0 1 0 0 o o ~ - - l o o o o . . . 0 1 0 . . .

Yuh-Jue Chuang, Ting-Jian Pan

National Taiwan University, Taipei, Taiwan, R.O.C.

Ja-Ling Wu Senior Member, IEEE

National Taiwan University, Taipei, Taiwan, R.0.C Department of Computer Science and Information Engineering Department of Computer Science and Information Engineering

Abstract-An efficient method for splitting an NxN 2-D DCT block into four (N/2)x(N/2) or two Nx(NI2) (or (N/2)xN) 2-D DCT blocks is presented and vice versa. The computational complexity of the proposed methods is lower than the direct approach and the same as the most efficient converting approach existed in the literature. Besides, the proposed DCT splittedmerger is suitable for implementation by specific

multimedia instruction set available nowadays. When N = 8, our

method can be applied to realize the transcoding between the latest video coding standards AVCm.264 and the older ones, such as MPEG-1, MPEG-2 and MPEG-4 part 2.

I. ~NTRODUCTION

Multimedia signals, such as images and video, are always compressed so as to save memory space and/or be easily transmitted via network. However, the signals have to be processed before being displayed, transmitted, printed, etc. Some of the frequently used manipulations are scaling, filtering, rotation and translation. Implementing these functions in the compressed domain is advantageous from the computational complexity point of view as well as the image quality and the memory usage. This is because the transition to the time or spatial domain and the recompression of the data are avoided. All existing

compression standards-JPEG, MPEG 1, MPEG2 and

H.26~-are based on the discrete cosine transform (DCT), which is applied to blocks of data of certain lengths. Our work aims to directly split an NxN 2-D DCT into four adjacent (N/2)x(N/2) 2-D DCTs without the need of performing an NxN IDCT and the four (N/2)x(N/2) DCTs. On the other hand, merging four adjacent (N/2)x(N/2) 2-D DCTs into one NxN DCT can also be realized by inverse transposing the "split" case. The block diagram of the proposed I-D DCT splittedmerger is shown in Fig.l(b), where DCT splitter/merger module have to be implemented by using multiplications, additions and permutations as little as possible. We will show the superiority of the proposed algorithm in computational complexity from both the algorithmic-level and the programming-level.

This paper is organized as follows. In Section 11, the splittedmerger of the 1-D/2-D DCT is addressed. Two kinds of computational complexity comparisons with other approaches are given in Section 111. Finally, Section IV concludes this write-up.

11. THE DCT SPLITTER AND/OR MERGER

For the sake of simplicity, let us confine our attention to the 1-D case, first. Assume that the DCT coefficients Yk and Zk, k = 0,

....

N/2, of two consecutive N/2-point data sequences Y = [ Y , , Y , ... Y , , , ~ ] ' and = = l z O

...

~ , , ~ l '

are given. The problem to be addressed is: how to efficiently compute the N-point DCT coefficients

xk,

k = 0, 1,

....

N, where

xk

stands for the DCT coefficients of

x = [ x o XI

...

XN]' = [ y o y , ... y,,, io ZI

...

Z#/*]'.

The normalized forward DCT Type41 (DCT-11) of a length-N sequence x, is, given in [I], as follows:

n(2n

+

I ) k

N - l

,,=O 2 . N

x,

= Ek~x"'cOs(- ) fork=0,1, .... N - 1 ( I ) where. &k = & , f o r k = o and Q = &,fork + O

.

The DCT-11's of the length-N/2 sequences y and z can be represented by similar expressions. The DCT defined in ( I )

f

YlOCI SplttierlMerger

f

W151CT

f

M K T 8b2 %Wn r q ?&le! m,tm (3) SWArnl c€?r"m

$ 3 tnha? and rear+.# %amptw nr 1 mparweir ((1 k T h a s ,

V Z a r u W ! X T + a n l u ( y 7

st+ I L

is!

Fig.1. (a) The traditional approach, and (b) The proposed approach for realizing the 1 -D DCT splittedmerger. can be written in matrix form as

x

= [ T , " ] x = [ C , ] [ T , ] x (2) where [C,] = diug(c0, E ! ,

....

EN-,) and [TN] is an NxN matrix

with elements [TA], = c o s ( ~ ) , Z ( 2 j + ! ) I i , j = 0, 1

...

N-1. In

transform coding applications, the DCT stage is usually followed by a quantization stage. The computation of ~k can

be saved by absorbing them into the quantization stage (the so-called non-scale case). In this case, we rewrite Eqn.(2) as

X = [ T , ] x (3)

A. Use recursive form [ f N I instead of [TN]:

Instead of viewing (3) as the real part of a scaled N-point DFT, Hou [2] delved into the symmetry of the kernel in (3) and partitioned the transform matrix [ T , ] into a recursive form. Let

[r,

1

= 1[T,

I[& I-',

it can be examined that

(4)

where

IF,]

is an (N x N) bit-reversal matrix and

[%,I

is a'lso an (N x N) permutation matrix defined as

1 0 0 0 0 . . . 0 0

0 0 0 0 1 ~. . o o

I

I . . . .

* This work was partially suppolted by the National Science Council and the Ministly

of Education of ROC under the contract No. NSC92-2622-E-002-002. NSC92-2213-E-002-023 and 89E-FA06-2-4-8.

0-7803-8346-X/04/$20.00 0 2 0 0 4 IEEE

(2)

The transform matrix

if,,]

can be delineated by considering the block matrix factorization as:

B. EfJicient DCT domain split and merge algorithms B.l One dimensional case

Define [U,,] as another permutation matrix such that be verified that [ U , 1.' = [ U ,

I'

and

(8) [ I , ,I 0 0 0

Substituting (7) and (8) into (6) leads to

Consequently,

Then, it is clear that and

[f\/*I[F

/ 2 1 Y

=([F\/21x,

+IM'\/2

1 ' [ ~ \ / , 1 - ' [ ~ \ / 2 1 x " ~ ~ 2 ~

(11) [f\/21[7, / l I F \ / 2 1 2

=([ij,,,,K

- [ M ' \ / * l-'[K,,L1-'[ij,/21x,)12 3 (12)

where

(X,)

and

(X,)

are the even- and odd- numbered transformed components of

X,

and it can be easily to verify that

[ F , I ' X = [ [ ~ , , , I ' X ~

[p,

211x,,].

Notice that [T,,,l[p,,,l~=[ij\,,,]Y, as shown in ( I t ) , is a

reordered version of Y, and if we set

Z=[J,,JT,

21[i\ ]I[F, ,I., then one can show that

- [ I \

" I

2 1 .

[ J , ] =

[P,

][l'b'

Hence, the splitter algorithm can be described in matrix form as [I]=[s,Ix] and the merger algorithm is [x]=[S,,f' ,

where

[:I

B.2 Two dimensional case

Let [GI be the N x N two-dimensional (2D) DCT-I1 of the N x N data matrix [g], and their relationship can be represented in matrix form as:

Eqn.( 15) implies that the 2D-DCT can be implemented by performing a series of I-D transforms. The data matrix

k]

can be decomposed into four (N/2) x (N/2) data blocks, k I ] , [gz], [gj] and [g~]. Our goal is to split [GI into four (N/2) x

(N/2) matrices [GI], [G2], [Gj] and [G,], which respectively are the DCTs of [SI], [gz], [gj] and [g,], without the need of using an N x N -1DCT and four (N/2) x (N/2)-DCTs. Follow the same derivation given in subsection II.A, the split algorithm can be written as

[GI = [wkl[TNl-I and

kl

= [TNl-'[GlVNl (15)

and the merge algorithm as

We can also respectively split [GI into two N x (N/2) or (N/2)

x N DCT blocks as

and

Notice that, these two kinds of 2-D DCTs had also been included in early versions of the AVC/H.264[3] for N = 8.

C. A concrete example: the (8x8) to (4x4) DCT-II splitter For a long time, the 8x8 DCT has been adopted extensively by the major coding standards for representing digital image, delivering broadcast video and making personal visual communications. Nowadays, the latest AVClH.264 [4] video coding standard provides the capability to predict samples locally down to 4x4 blocks before operating transform coding. As a result, an impressive compression gain has been obtained, about 1.5 to 2.0 times as compared with the MPEG-4 part 2, while a nearly the same perceptual picture quality will be kept. Due to the wide-spreading of DVD and the trend of HDTV/DTV, most existing video contents are still encoded in MPEG-2 form. Therefore, the MPEC-2-to-AVC/H.264 transcoder or vice versa, is believed to be an indispensable functional module of future video signal treatments.

The simplest way to develop an MPEG-2-to-AVCM.264 transcoder is directly cascading an 8x8 IDCT with four 4x4 DCTs. Intuitively, this direct approach is considered to be computationally intensive and suffers from certain degree of quality loss. On the other hand, by using the DCT splitter and/or merger has discussed in last section, a more efficient and precise video transcoder can be implemented in the DCT transform domain, directly. According to Eqns.( 16) and ( 1 7), an 8x8 DCT-I1 block [GI can be split into four 4x4 DCT-II blocks [GI], [Gz], [Gj] and [GJ and vice versa as

where

The signal-flow graph for realizing the proposed transform

(3)

domain splitting and merging between one 8-point DCT and two 4-point DCTs is depicted in Fig.2.

Different

The number of operations needed for splitting an NxN

2-D DCT blocks into four (N/Z)x(N/2) 2-D DCT blocks in:

1 0 . ' . 0 - _

; ; o . . o

o + t o . o ,

lL,.,*I-' = , ,

;

, , , 0 . . 1 1 0 0 . . 0 1 1 2 2 2 z 2

B. The programming-level analysis

Another way to increase performance of a signal processing task is to execute several computations in parallel. Without algorithmic-level optimization, the direct approach can be speeded-up by using today's multimedia instruction set technology. In the following, we will show that our splittedmerger is much more suitable to be realized on the processors, with SIMD instructions, than other approaches.

B. 1. SIMD instructions

Intel Pentium 4 processors extended SIMD computation model with the introduction of SSE2, which operates on packed double (float)-precision data elements as well as 128-bit packed integers. The full set of SIMD capability greatly improves the performance of multimedia related applications. For example, a 128-bit SSE2 register can be (22)

(4)

used to store up to 16 units of 8-bit integers, and up to 16 arithmetic operations can be executed simultaneously by using two SSE2 registers. This results in a significant performance improvement, especially for video encoding.

In our splittedmerger, multiplication of matrices is the key. The SEE2 instructions, PMADDWD and PADDD, can facilitate this implementation. Fig.3 shows an example for computing the first row of the multiplication of two 4x4 matrices. Matrix X and Y are loaded into SSE registers in a particular form. After that, PMADDWD is performed on these registers in pairs. Eight multiplications are performed on pairs of 16-bit integers simultaneously. And the results, in pairs, are added together and stored in one SSE register as

32-bit integers. Then the two resulting 128-bit data are added together in the 32-bit integer precision by PADDD, hence the first row of the resulting matrix is obtained right way. For the next three rows, similar steps utilized. Therefore, 8 PMADDWDs and 4PADDDs are required for realizing 4x4 matrix multiplications. The time spent, in this approach, is much less than that of 43 multiplications and 3 ~additions 4 ~ required in the direct calculation.

‘00 r o i ‘02 ‘n; Xw Xni Xu2 Xoi Yni YUI Yo2 Yo,

[“:

3;

;

;j=fl=[:::

Xiii

1:;

::I

X7z

::][;:

X,> Yw

;::

Y,,

I::

Y v

;~~]

Y n

‘30

(a)

I

xoz

I

x o ~ I xoz I ~ 0 3 I ~ n z I XOJ I *nz I x03 J

3

5

1

$

I

Y Z O

I

Y J O

1

,vZ/

I

,v3/

1

y22

I

Y32

I

rZ3

I

y33

I

PMADDWD U

xOflOO+xO3y/O

I

xOflZl’xO3)’3/

I

xO#22+x03.Y32

I

xO2~23+xO3~33 xnoynnfxniyia ~nnyiu +xni.vi/ Xiinyo>+xi,iy/> . x I J ~ ~ ~ ~ J + x ~ I , v I ~

+

(b)

Fig.3. (a) A 4x4 matrix-by-matrix multiplication. (b) An

example for eficient implementation of 4x4

matrix-by-matrix multiplication. B.2. Performance Comparisons

The proposed splitter/merger and the direct approaches can be speeded-up by SSE2 instructions because both of them are matrix-based algorithms. Let’s investigate the most important case of splitting an 8-point DCT into two 4-point DCTs in detailed. The scale factors are saved by absorbing them into the quantization stage in the following analyses.

I ) The proposed splitter/merger: We need to count the

number of operations required for

[&.I.

In (21), [Eland [F,] can be implemented by shifts and permutations; two PADDDs are required for realizing an 8x 1 vector multiplied by

[!;:;

!:<d.

8 PMADDWDs and 4 PADDDs are required for computing the product of

[c]

and [f4]-’ with a 4x1 vector, respectively. Since [g41 is a diagonal matrix, the matrix-vector product by [Q,] can be done by one PMADDWD. Since

[ I 0 0 01

the matrix-vector product by [ L , ] - ’ can be done by one PADDD and bit shifts. It shows that the total numbers of operations required for the proposed splitter/merger are 17

PMADDWDs and 11 PADDDs.

2) Direct approach: An 8-point IDCT is done first and then two forward 4-point DCTs are taken, which can be written matrix form as

left-down, right-up and right-down 4x4 matrices of [ T ~ I - I , respectively. To accomplish Eqn.(23), it needs four 4x4

matrix-by-matrix multiplications and four 4x4

matrix-by-vector multiplications. Thus, 40 PMADDWDs and 20 PADDDs are needed.

3) A. N. Skodrask approach: In this the approach, one can not easily load input data into those longer registers, thus this approach can not gain any benefit from SSE2 instructions.

It is shown that the direct approach can optimize its implementation by using SIMD technology, however, without algorithmic-level optimization, the speed-up of the optimization will be limited. Integrating an effective algorithmic-level derivation and programming-level optimizations with SIMD instructions, our approach gets the best performance than other known algorithms.

IV. CONCLUSIONS

An efficient method for direct splitting of an NxN 2-D DCT coefficients into four adjacent (N/2)x (N/2), two adjacent Nx(N/2) (or (N/2)xN) 2-D DCT coefficients has been proposed. The algorithmic-level computational complexity of the proposed algorithm is lower than the direct approach and the programming-level analysis tells us that the proposed approach can get the best performance with the aid of available multimedia instruction set technology. Due to its eficiency, the proposed DCT splitterlmerger can be applied

to realize the transcoding between the latest video coding standard, AVUH.264, and the older ones, such as MPEG-1, MPEG-2 and MPEG-4 part 2.

REFERENCES

K. R. Rao and P. Yip, “Discrete Cosine Transform: Algorithms, Advantages and Applications.” New York: Academic, 1990.

Hsieh S.Hou, “A Fast Recursive Algorithm For Computing the Discrete Cosine Transform,” IEEE Trans. On ASSP, Oct. 1987 ITU-T Rec. H.264 / ISO/IEC FDIS 11496-10, “Advanced Video Coding,” Final Draft International Standard, March 2003

ITU-‘I‘ Rec. H.264 I ISO/IEC 11496-10, “Advanced Video

Coding,” Final Committee Draft, Document JVT-E022, September 2002

X. Zhou, E. Q. Li, and Y.-K. Chen “Implementation of H.264

Decoder on General-Purpose Processors with Media Instructions,” SPIE Conf. on Image and Video Communications and Processing, Jan. 2003.

Y.-K. Chen, N. Yu, and B. Shah, “Digital Signal Processing on

MMX Technology,” in Programmable Digital Signal Processors: Architecture, Programming and Design, Y. H. Hu, Ed., (Marcel Dekker: NY), pp. 295-331,2002.

A. N. Skodras, “Fast discrete cosine transform pruning,” IEEE Transsignal Processing, vol. 42, pp. 1833-1 837, July 1994. C. W. Kok, “Fast algorithm for computing discrete cosine transform,” IEEE Trans. Signal Processing, vol. 45, pp.

757-760, Mar. 1997.

Athanassios N. Skodras, “Direct Transform to Transform Computation,” IEEE Trans. On SP Letters, Vo1.6 No.8. Aug 1999.