A Refined Fast 2-D Discrete Cosine Transform Algorithm

(1)

904 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 47, NO. 3, MARCH 1999

A Refined Fast 2-D Discrete Cosine Transform Algorithm Yuh-Ming Huang and Ja-Ling Wu

Abstract—In this correspondence, an index permutation-based fast two-dimensional discrete cosine transform (2-D DCT) algorithm is presented. It is shown that theN 2 NN 2 NN 2 N 2-D DCT, where N = 2N = 2N = 2mmm, can be computed using onlyNNN 1-D DCT’s and some post additions.

Index Terms— Discrete cosine transform, fast algorithms, index per-mutation.

I. INTRODUCTION

The DCT is widely used in many digital signal processing ap-plications. Fast algorithms have been reported in the literature (i.e., [1]–[6]). Among those algorithms, [5] and [6] are believed to be the most efficient 2-D DCT algorithms in the sense of minimizing any measure of computational complexity.

Recently, Cho and Lee [7] have proposed a fast and modular DCT algorithm in which an (N 2 N)-point 2-D DCT could be obtained by computingN 2 N-point 1-D DCT’s and log₂N + 1 butterfly stages. In a later work [8], they also provide regular expressions for the input–output relations of the post-addition stages. However, the number of required additions increases at the expense of improving the regularity in the structure.

Based on the idea of [7], in this correspondence, an index permutation-based fast 2-D DCT algorithm is proposed.

II. THEREFINEDFASTALGORITHM FORCOMPUTING THE2-D DCT For a given input data sequencefi; j; 0 i N 0 1; 0 j N 0 1, the denormalized 2-D DCT can be expressed as [1]

Ym; n= N01

i=0 N01

j=0

fi; j cos (2i + 1)m_2N cos (2j + 1)n_2N 0 m N 0 1; 0 n N 0 1: (1) After some permutation of the input data sequence [2], (1) can be written as Ym; n= N01 i=0 N01 j=0

Xi; jcos (4i + 3)m_2N cos (4j + 3)n_2N 0 m N 0 1; 0 n N 0 1 (2.1)

Manuscript received July 8, 1996; revised July 31, 1998. The associate editor coordinating the review of this paper and approving it for publication was Dr. Xiang-Gen Xia.

Y.-M. Huang is with the Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan, R.O.C., and the Department of Information Engineering, National Chi-Nan University, Puli, Taiwan, R.O.C.

J.-L. Wu is with the Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan, R.O.C.

Publisher Item Identifier S 1053-587X(99)01358-6.

where Xi; j= f2i+1; 2j+1 0 i; j N=2 0 1 f2i+1; 2N02j02 0 i N=2 0 1; N=2 j N 0 1 f2N02i02; 2j+1 N=2 i N 0 1; 0 j N=2 0 1 f2N02i02; 2N02j02 N=2 i; j N 0 1: (2.2)

Based on the idea of [7], let

Am; n= N01 i=0 N01 j=0 Xi; jcos (4i + 3)m + (4j + 3)n_2N 0 m N 0 1; 0 n N 0 1 (3.1) and Bm; n= N01 i=0 N01 j=0 Xi; jcos (4i + 3)m 0 (4j + 3)n_2N 0 m N 0 1; 0 n N 0 1: (3.2) Then Ym; n= (Am; n+ Bm; n)=2: (4) Since4j + 3 and N are coprime to each other, i.e., (4j + 3; N)1

= 1, the permutation (4j + 3)i + j modulo N maps all values of i. Letqij be the quotient of(4j + 3)i + j divided by N. Hence, the kernels of the 2-D transforms in (3.1) and (3.2) can be rewritten as 1-D DCT’s by replacingi with (4j + 3)i + j 0 Nqij. That is

Am; n= N01 i=0 N01 j=0 Xh(4j+3)i+ji ; j 1 cos (4j + 3)((4i + 1)m + n) 0 4Nq_2N ijm =N01 i=0 N01 j=0 Xh(4j+3)i+ji ; j 1 cos (4j + 3)((4i + 1)m + n)_2N 0 m N 0 1; 0 n N 0 1 (5.1) and Bm; n= N01 i=0 N01 j=0 Xh(4j+3)i+ji ; j 1 cos (4j + 3)((4i + 1)m 0 n) 0 4Nq_2N ijm = N01 i=0 N01 j=0 Xh(4j+3)i+ji ; j 1 cos (4j + 3)((4i + 1)m 0 n)_2N 0 m N 0 1; 0 n N 0 1 (5.2) where hxiN denotes x modulo N.

For the simplicity of notation,X_{h(4j+3)i+ji ; j}is denoted asX_{i; j}. Then, it can be seen that the 2-D input data sequence is grouped into

1_{(a; b) denotes the gcd of a and b.}

1053–587X/99$10.001999 IEEE

(2)

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 47, NO. 3, MARCH 1999 905 N distinct data sets of size N, that is, f(Xi; j; 0 j N 01); 0

i N 0 1g, and the equations N01 j=0 Xi; j cos (4j + 3)((4i + 1)m + n)_2N (6.1) and N01 j=0 Xi; j cos (4j + 3)((4i + 1)m 0 n)_2N (6.2) correspond to one of the 1-D DCT’s of the data sequence ^X_{i; j} or equal to zero with respect tom and n. That is, by defining

hil= N01

j=0

Xi; j cos (4j + 3)l_2N (7) we can see that (6.1) and (6.2) correspond to one of+h_il or0h_il for somel = 0; 1; 1 1 1 ; N 0 1 or equal to zero. Besides, through an index permutation, (7) can be implemented by a 1-D DCT as

hil= N01 j=0 ^ Xi; jcos (2j + 1)l_2N (8.1) where ^ Xi; j= X_Xi; (j01)=2 j: odd i; N010j=2 j: even. (8.2) Hence, for the computation of an(N 2 N)-point 2-D DCT, only the computation ofN 2N-point 1-D DCT’s and some post-additions are required.

Next, we will show that the post-addition stages can be imple-mented by a butterfly-like structure. Since

cos (4j + 3)((4(i + N=2) + 1)m + n)_2N = 6 cos (4j + 3)((4i + 1)m + n)_2N (9.1) and cos (4j + 3)((4(i + N=2) + 1)m 0 n) 2N = 6 cos (4j + 3)((4i + 1)m 0 n) 2N : (9.2)

Hence,Ym; ncan be expressed as in (10), shown at the bottom of the page. LetX_{i; j}+X_{(i+N=2); j}= R_{i; j},X_{i; j}0X_{(i+N=2); j}= S_{i; j}.

For m even, since

cos (4j + 3)((4(i + N=4) + 1)m + n)_2N

= 6 cos (4j + 3)((4i + 1)m + n)_2N (11.1)

Fig. 1. Signal flow graph for 42 4 DCT.

and

cos (4j + 3)((4(i + N=4) + 1)m 0 n)_2N

= 6 cos (4j + 3)((4i + 1)m 0 n)_2N (11.2) Ym; n can be written as in (12), shown at the bottom of the page, where k = 0; 1; 1 1 1 ; N=4 0 1.

For example, ifN = 4, by (12), we have Y2; n= 1₂ 3 j=0 (R0; j0 R1; j) cos (4j + 3)(2 + n)_2N + cos (4j + 3)(2 0 n)_2N : (13) Ym; n= 1 2 N=201 i=0 N01 j=0

(Xi; j+ X(i+N=2); j) cos (4j + 3)((4i + 1)m + n)_2N + cos (4j + 3)((4i + 1)m 0 n)_2N ; for m even 1 2 N=201 i=0 N01 j=0

(Xi; j0 X(i+N=2); j) cos (4j + 3)((4i + 1)m + n)_2N + cos (4j + 3)((4i + 1)m 0 n)_2N ; for m odd (10) Ym; n= 1 2 N=401 i=0 N01 j=0

(Ri; j+ R(i+N=4); j) cos (4j+3)((4i+1)m+n)_2N +cos (4j+3)((4i+1)m0n)_2N ; for m = 2(2k) 1 2 N=401 i=0 N01 j=0

(Ri; j0 R(i+N=4); j) cos (4j+3)((4i+1)m+n)_2N +cos (4j+3)((4i+1)m 0 n)_2N ; for m = 2(2k+1) (12)

(3)

906 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 47, NO. 3, MARCH 1999

(a)

Fig. 2. Signal flow graph for 82 8 DCT. (a) First stage of the post-addition stages.

Form odd, let kil= N01 j=0 Si; j cos (4j + 3)l_2N : (14) Since cos (4j + 3)((4(i + N=4) + 1)m + n)_2N = 6 sin (4j + 3)((4i + 1)m + n)_2N (15.1) and cos (4j + 3)((4(i + N=4) + 1)m 0 n)_2N = 6 sin (4j + 3)((4i + 1)m 0 n) 2N : (15.2) However sin (4j + 3)l_2N = 0 cos (4j + 3)(N 0 l)_2N (16) that is, the 1-D discrete sine transform can be directly computed from the 1-D discrete cosine transform. Therefore, for somer and

(b)

Fig. 2. (Continued). Signal flow graph for 82 8 DCT. (b) Second, third, and fourth stages of the post-addition stages (forn is even).

s, 0 r; s N; Ym; ncan be written as shown in (17), shown at the bottom of the page, wherek = 0; 1; 1 1 1 ; N=4 0 1, and Girand His are, respectively, equal to6kir and 6kis.

For example, ifN = 4; r = 3, and s = 1, by (17), we have Y3; 2=1₂f(0k03+ k11) + (k01+ k13)g:

As a result, the computation of an(N 2 N)-point 2-D DCT can be achieved by recursively applying the above decompositions [(12) and (17)]. The signal flow graphs for a 4 2 4 and an 8 2 8 DCT are shown in Figs. 1 and 2, respectively.

Ym; n= 1 2

N=401 i=0

(Gir0 G(i+N=4)(N0r)) + (His+ H(i+N=4)(N0s)) form = 2(2k) + 1 1

2 N=401

i=0

(Gir+ G(i+N=4)(N0r)) + (His0 H(i+N=4)(N0s)) form = 2(2k + 1) + 1

(17)

(4)

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 47, NO. 3, MARCH 1999 907

(c)

Fig. 2. (Continued). Signal flow graph for 82 8 DCT. (c) Second, third, and fourth stages of the post-addition stages (forn is odd).

III. COMPLEXITYANALYSIS OF THEPOST-ADDITIONSTAGES For the post-addition stages, let A(N) and B(N), respectively, denote the number of all required additions and the number of additions required in the final stage, and letC(N) denote the number of nodes that do not require butterfly computations in the firstlog₂N stages. From (13) and (16), we have C(4) = 2 and C(N) = C(N=2) + N=2 for N 8, and B(N) = N2_{0 2N. Therefore,} A(N) = N2_log

2N 0C(N)+B(N) = N2(1+log2N)03N +2.

IV. CONCLUSIONS

An index-permutation based 2-D DCT algorithm has been pre-sented in this correspondence. The succinct derivation of the proposed algorithm makes it easier to describe the process of how to map one 2-D DCT into a number of 1-D DCT’s. From the idea of [8], a matrix-form-based systematic expression for the post-addition stages

in the proposed algorithm, which may improve the regularity of the structure, is currently under investigation.

REFERENCES

[1] R. K. Rao and P. Yip, Discrete Cosine Transform: Algorithm, Advan-tages, and Applications. New York: Academic, 1990.

[2] M. J. Narasimha and A. M. Peterson, “On the computation of the discrete cosine transform,” IEEE Trans. Commun., vol. COMM-26, pp. 934–936, June 1978.

[3] H. S. Hou, “A fast recursive algorithms for computing the discrete cosine transform,” IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-35, pp. 1455–1461, Oct. 1987.

[4] P. Lee and F.-Y. Huang, “Restructured recursive DCT and DST algo-rithms,” IEEE Trans. Signal Processing, vol. 42, pp. 1600–1609, July 1994.

[5] P. Duhamel and C. Guillemot, “Polynomial transform computation of 2-D DCT,” in Proc. ICASSP, Apr. 1990, pp. 1515–1518.

[6] E. Feig and S. Winograd, “Fast algorithms for the discrete cosine transform,” IEEE Trans. Signal Processing, vol. 40, pp. 2174–2193, Sept. 1992.

[7] N. I. Cho and S. U. Lee, “Fast algorithm and implementation of 2-D DCT,” IEEE Trans. Circuits Syst., vol. 38, pp. 297–305, Mar. 1991. [8] N. I. Cho, I. D. Yun, and S. U. Lee, “On the regular structure for the fast

2-D DCT algorithm,” IEEE Trans. Circuits Syst., vol. 40, pp. 259–266, Apr. 1993.

An Effective Memory Addressing Scheme for FFT Processors

Yutai Ma

Abstract—The memory organization of FFT processors is considered. The new memory addressing assignment allows simultaneous access to all the data needed for butterfly calculations. The advantage of this memory addressing scheme lies in the fact that it reduces the delay of address generation nearly by half compared to existing ones.

I. INTRODUCTION

Many high-speed FFT processors have been obtained by imple-menting the fast Fourier transform in pipelined digital hardware with a butterfly calculation unit, two-port data memory, ROM for storing twiddle factors, and memory addressing controller integrated on a chip. It is possible to use an in-place strategy that stores butterfly outputs in those memory locations used by the inputs to the same butterfly. The in-place strategy requires only a minimum amount of memory. For this reason, only the in-place radix-2 decimation-in-time version of the fast Fourier transform is considered here.

If the butterfly unit has parallel inputs and outputs, then the two butterfly inputs will be accessed in the memory, and two butterfly outputs will be written back to the same memory in each cycle. In order to avoid this memory bottleneck, the two-port memory module Manuscript received May 23, 1997; revised July 22, 1998. This work was supported by the National Key Project of Fundamental Research, P.R. China. The associate editor coordinating the review of this paper and approving it for publication was Dr. Elias S. Manolakos.

The author was with the Center for High Performance Computing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, P. R. China. He is now with the Department of Electrical Engineering, Link¨oping University, Link¨oping, Sweden.

Publisher Item Identifier S 1053-587X(99)00736-9.

1053–587X/99$10.001999 IEEE