904 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 47, NO. 3, MARCH 1999
A Refined Fast 2-D Discrete Cosine Transform Algorithm Yuh-Ming Huang and Ja-Ling Wu
Abstract—In this correspondence, an index permutation-based fast two-dimensional discrete cosine transform (2-D DCT) algorithm is presented. It is shown that theN 2 NN 2 NN 2 N 2-D DCT, where N = 2N = 2N = 2mmm, can be computed using onlyNNN 1-D DCT’s and some post additions.
Index Terms— Discrete cosine transform, fast algorithms, index per-mutation.
I. INTRODUCTION
The DCT is widely used in many digital signal processing ap-plications. Fast algorithms have been reported in the literature (i.e., [1]–[6]). Among those algorithms, [5] and [6] are believed to be the most efficient 2-D DCT algorithms in the sense of minimizing any measure of computational complexity.
Recently, Cho and Lee [7] have proposed a fast and modular DCT algorithm in which an (N 2 N)-point 2-D DCT could be obtained by computingN 2 N-point 1-D DCT’s and log2N + 1 butterfly stages. In a later work [8], they also provide regular expressions for the input–output relations of the post-addition stages. However, the number of required additions increases at the expense of improving the regularity in the structure.
Based on the idea of [7], in this correspondence, an index permutation-based fast 2-D DCT algorithm is proposed.
II. THEREFINEDFASTALGORITHM FORCOMPUTING THE2-D DCT For a given input data sequencefi; j; 0 i N 0 1; 0 j N 0 1, the denormalized 2-D DCT can be expressed as [1]
Ym; n= N01
i=0 N01
j=0
fi; j cos (2i + 1)m2N cos (2j + 1)n2N 0 m N 0 1; 0 n N 0 1: (1) After some permutation of the input data sequence [2], (1) can be written as Ym; n= N01 i=0 N01 j=0
Xi; jcos (4i + 3)m2N cos (4j + 3)n2N 0 m N 0 1; 0 n N 0 1 (2.1)
Manuscript received July 8, 1996; revised July 31, 1998. The associate editor coordinating the review of this paper and approving it for publication was Dr. Xiang-Gen Xia.
Y.-M. Huang is with the Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan, R.O.C., and the Department of Information Engineering, National Chi-Nan University, Puli, Taiwan, R.O.C.
J.-L. Wu is with the Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan, R.O.C.
Publisher Item Identifier S 1053-587X(99)01358-6.
where Xi; j= f2i+1; 2j+1 0 i; j N=2 0 1 f2i+1; 2N02j02 0 i N=2 0 1; N=2 j N 0 1 f2N02i02; 2j+1 N=2 i N 0 1; 0 j N=2 0 1 f2N02i02; 2N02j02 N=2 i; j N 0 1: (2.2)
Based on the idea of [7], let
Am; n= N01 i=0 N01 j=0 Xi; jcos (4i + 3)m + (4j + 3)n2N 0 m N 0 1; 0 n N 0 1 (3.1) and Bm; n= N01 i=0 N01 j=0 Xi; jcos (4i + 3)m 0 (4j + 3)n2N 0 m N 0 1; 0 n N 0 1: (3.2) Then Ym; n= (Am; n+ Bm; n)=2: (4) Since4j + 3 and N are coprime to each other, i.e., (4j + 3; N)1
= 1, the permutation (4j + 3)i + j modulo N maps all values of i. Letqij be the quotient of(4j + 3)i + j divided by N. Hence, the kernels of the 2-D transforms in (3.1) and (3.2) can be rewritten as 1-D DCT’s by replacingi with (4j + 3)i + j 0 Nqij. That is
Am; n= N01 i=0 N01 j=0 Xh(4j+3)i+ji ; j 1 cos (4j + 3)((4i + 1)m + n) 0 4Nq2N ijm =N01 i=0 N01 j=0 Xh(4j+3)i+ji ; j 1 cos (4j + 3)((4i + 1)m + n)2N 0 m N 0 1; 0 n N 0 1 (5.1) and Bm; n= N01 i=0 N01 j=0 Xh(4j+3)i+ji ; j 1 cos (4j + 3)((4i + 1)m 0 n) 0 4Nq2N ijm = N01 i=0 N01 j=0 Xh(4j+3)i+ji ; j 1 cos (4j + 3)((4i + 1)m 0 n)2N 0 m N 0 1; 0 n N 0 1 (5.2) where hxiN denotes x modulo N.
For the simplicity of notation,Xh(4j+3)i+ji ; jis denoted asXi; j. Then, it can be seen that the 2-D input data sequence is grouped into
1(a; b) denotes the gcd of a and b.
1053–587X/99$10.001999 IEEE
IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 47, NO. 3, MARCH 1999 905 N distinct data sets of size N, that is, f(Xi; j; 0 j N 01); 0
i N 0 1g, and the equations N01 j=0 Xi; j cos (4j + 3)((4i + 1)m + n)2N (6.1) and N01 j=0 Xi; j cos (4j + 3)((4i + 1)m 0 n)2N (6.2) correspond to one of the 1-D DCT’s of the data sequence ^Xi; j or equal to zero with respect tom and n. That is, by defining
hil= N01
j=0
Xi; j cos (4j + 3)l2N (7) we can see that (6.1) and (6.2) correspond to one of+hil or0hil for somel = 0; 1; 1 1 1 ; N 0 1 or equal to zero. Besides, through an index permutation, (7) can be implemented by a 1-D DCT as
hil= N01 j=0 ^ Xi; jcos (2j + 1)l2N (8.1) where ^ Xi; j= XXi; (j01)=2 j: odd i; N010j=2 j: even. (8.2) Hence, for the computation of an(N 2 N)-point 2-D DCT, only the computation ofN 2N-point 1-D DCT’s and some post-additions are required.
Next, we will show that the post-addition stages can be imple-mented by a butterfly-like structure. Since
cos (4j + 3)((4(i + N=2) + 1)m + n)2N = 6 cos (4j + 3)((4i + 1)m + n)2N (9.1) and cos (4j + 3)((4(i + N=2) + 1)m 0 n) 2N = 6 cos (4j + 3)((4i + 1)m 0 n) 2N : (9.2)
Hence,Ym; ncan be expressed as in (10), shown at the bottom of the page. LetXi; j+X(i+N=2); j= Ri; j,Xi; j0X(i+N=2); j= Si; j.
For m even, since
cos (4j + 3)((4(i + N=4) + 1)m + n)2N
= 6 cos (4j + 3)((4i + 1)m + n)2N (11.1)
Fig. 1. Signal flow graph for 42 4 DCT.
and
cos (4j + 3)((4(i + N=4) + 1)m 0 n)2N
= 6 cos (4j + 3)((4i + 1)m 0 n)2N (11.2) Ym; n can be written as in (12), shown at the bottom of the page, where k = 0; 1; 1 1 1 ; N=4 0 1.
For example, ifN = 4, by (12), we have Y2; n= 12 3 j=0 (R0; j0 R1; j) cos (4j + 3)(2 + n)2N + cos (4j + 3)(2 0 n)2N : (13) Ym; n= 1 2 N=201 i=0 N01 j=0
(Xi; j+ X(i+N=2); j) cos (4j + 3)((4i + 1)m + n)2N + cos (4j + 3)((4i + 1)m 0 n)2N ; for m even 1 2 N=201 i=0 N01 j=0
(Xi; j0 X(i+N=2); j) cos (4j + 3)((4i + 1)m + n)2N + cos (4j + 3)((4i + 1)m 0 n)2N ; for m odd (10) Ym; n= 1 2 N=401 i=0 N01 j=0
(Ri; j+ R(i+N=4); j) cos (4j+3)((4i+1)m+n)2N +cos (4j+3)((4i+1)m0n)2N ; for m = 2(2k) 1 2 N=401 i=0 N01 j=0
(Ri; j0 R(i+N=4); j) cos (4j+3)((4i+1)m+n)2N +cos (4j+3)((4i+1)m 0 n)2N ; for m = 2(2k+1) (12)
906 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 47, NO. 3, MARCH 1999
(a)
Fig. 2. Signal flow graph for 82 8 DCT. (a) First stage of the post-addition stages.
Form odd, let kil= N01 j=0 Si; j cos (4j + 3)l2N : (14) Since cos (4j + 3)((4(i + N=4) + 1)m + n)2N = 6 sin (4j + 3)((4i + 1)m + n)2N (15.1) and cos (4j + 3)((4(i + N=4) + 1)m 0 n)2N = 6 sin (4j + 3)((4i + 1)m 0 n) 2N : (15.2) However sin (4j + 3)l2N = 0 cos (4j + 3)(N 0 l)2N (16) that is, the 1-D discrete sine transform can be directly computed from the 1-D discrete cosine transform. Therefore, for somer and
(b)
Fig. 2. (Continued). Signal flow graph for 82 8 DCT. (b) Second, third, and fourth stages of the post-addition stages (forn is even).
s, 0 r; s N; Ym; ncan be written as shown in (17), shown at the bottom of the page, wherek = 0; 1; 1 1 1 ; N=4 0 1, and Girand His are, respectively, equal to6kir and 6kis.
For example, ifN = 4; r = 3, and s = 1, by (17), we have Y3; 2=12f(0k03+ k11) + (k01+ k13)g:
As a result, the computation of an(N 2 N)-point 2-D DCT can be achieved by recursively applying the above decompositions [(12) and (17)]. The signal flow graphs for a 4 2 4 and an 8 2 8 DCT are shown in Figs. 1 and 2, respectively.
Ym; n= 1 2
N=401 i=0
(Gir0 G(i+N=4)(N0r)) + (His+ H(i+N=4)(N0s)) form = 2(2k) + 1 1
2 N=401
i=0
(Gir+ G(i+N=4)(N0r)) + (His0 H(i+N=4)(N0s)) form = 2(2k + 1) + 1
(17)
IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 47, NO. 3, MARCH 1999 907
(c)
Fig. 2. (Continued). Signal flow graph for 82 8 DCT. (c) Second, third, and fourth stages of the post-addition stages (forn is odd).
III. COMPLEXITYANALYSIS OF THEPOST-ADDITIONSTAGES For the post-addition stages, let A(N) and B(N), respectively, denote the number of all required additions and the number of additions required in the final stage, and letC(N) denote the number of nodes that do not require butterfly computations in the firstlog2N stages. From (13) and (16), we have C(4) = 2 and C(N) = C(N=2) + N=2 for N 8, and B(N) = N20 2N. Therefore, A(N) = N2log
2N 0C(N)+B(N) = N2(1+log2N)03N +2.
IV. CONCLUSIONS
An index-permutation based 2-D DCT algorithm has been pre-sented in this correspondence. The succinct derivation of the proposed algorithm makes it easier to describe the process of how to map one 2-D DCT into a number of 1-D DCT’s. From the idea of [8], a matrix-form-based systematic expression for the post-addition stages
in the proposed algorithm, which may improve the regularity of the structure, is currently under investigation.
REFERENCES
[1] R. K. Rao and P. Yip, Discrete Cosine Transform: Algorithm, Advan-tages, and Applications. New York: Academic, 1990.
[2] M. J. Narasimha and A. M. Peterson, “On the computation of the discrete cosine transform,” IEEE Trans. Commun., vol. COMM-26, pp. 934–936, June 1978.
[3] H. S. Hou, “A fast recursive algorithms for computing the discrete cosine transform,” IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-35, pp. 1455–1461, Oct. 1987.
[4] P. Lee and F.-Y. Huang, “Restructured recursive DCT and DST algo-rithms,” IEEE Trans. Signal Processing, vol. 42, pp. 1600–1609, July 1994.
[5] P. Duhamel and C. Guillemot, “Polynomial transform computation of 2-D DCT,” in Proc. ICASSP, Apr. 1990, pp. 1515–1518.
[6] E. Feig and S. Winograd, “Fast algorithms for the discrete cosine transform,” IEEE Trans. Signal Processing, vol. 40, pp. 2174–2193, Sept. 1992.
[7] N. I. Cho and S. U. Lee, “Fast algorithm and implementation of 2-D DCT,” IEEE Trans. Circuits Syst., vol. 38, pp. 297–305, Mar. 1991. [8] N. I. Cho, I. D. Yun, and S. U. Lee, “On the regular structure for the fast
2-D DCT algorithm,” IEEE Trans. Circuits Syst., vol. 40, pp. 259–266, Apr. 1993.
An Effective Memory Addressing Scheme for FFT Processors
Yutai Ma
Abstract—The memory organization of FFT processors is considered. The new memory addressing assignment allows simultaneous access to all the data needed for butterfly calculations. The advantage of this memory addressing scheme lies in the fact that it reduces the delay of address generation nearly by half compared to existing ones.
I. INTRODUCTION
Many high-speed FFT processors have been obtained by imple-menting the fast Fourier transform in pipelined digital hardware with a butterfly calculation unit, two-port data memory, ROM for storing twiddle factors, and memory addressing controller integrated on a chip. It is possible to use an in-place strategy that stores butterfly outputs in those memory locations used by the inputs to the same butterfly. The in-place strategy requires only a minimum amount of memory. For this reason, only the in-place radix-2 decimation-in-time version of the fast Fourier transform is considered here.
If the butterfly unit has parallel inputs and outputs, then the two butterfly inputs will be accessed in the memory, and two butterfly outputs will be written back to the same memory in each cycle. In order to avoid this memory bottleneck, the two-port memory module Manuscript received May 23, 1997; revised July 22, 1998. This work was supported by the National Key Project of Fundamental Research, P.R. China. The associate editor coordinating the review of this paper and approving it for publication was Dr. Elias S. Manolakos.
The author was with the Center for High Performance Computing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, P. R. China. He is now with the Department of Electrical Engineering, Link¨oping University, Link¨oping, Sweden.
Publisher Item Identifier S 1053-587X(99)00736-9.
1053–587X/99$10.001999 IEEE