Recursive fast computation of the two-dimensional
discrete
cosine
transform
W.-H. Fang S.-K. Shih N.-C.Hu
Abstract: An efficient algorithm is presented for computing the two-dimensional discrete cosine transform (2-D DCT) whose size is a power of a prime. Based on a generalised 2-D to one- dimensional (1-D) index mapping scheme, the proposed algorithm decomposes the 2-D DCT outputs into three parts. The first part forms a 2- D DCT of a smaller size. The remaining outputs are further decomposed into two parts, depending on the summation of their indices. The latter two parts can be reformulated as a set of circular correlation (CC) or skew-circular correlation (SCC) matrix-vector products by utilising the recently addressed maximum coset decomp- osition. Such a decomposition procedure can be repetitively carried out for the 2-D DCT of the first part, resulting in a sequence of CC and SCC matrix-vector products of various sizes. Employing fast algorithms for the computation of these CCiSCC operations can substantially reduce the numbers of multiplications compared with those of the conventional row-column decomposition approach. In the special case where the data size is a power of two, the proposed algorithm can be further simplified, calling for computations comparable with those of previous works.
1 Introduction
The DCT has found applications in various facets of signal-processing applications, ranging from transform domain adaptive filtering and the design of filter banks to speech coding [l]. The 2-D DCT, in particular, has been widely considered to be the most effective scheme for transform coding. As such, various international standards, such as the MPEG-1,2 and H.261, have incorporated the 2-D DCT as the imageivideo coding ingredient block. To facilitate real-time implementa- tions, the development of efficient algorithms for com- puting the 2-D DCT has been of great interest over the past few decades.
The conventional approach for computing the 2-D DCT is based on the row-column decomDosition
0 IEE, 1999
IEE Proceedings online no. 19990021
DOI: 10.1049/ip-vis: 19990021
Paper first received 5th May and in revised form 26th October 1998 The authors are with the Department of Electronic Engineering, National Taiwan University of Science & Technology, Taipei, Taiwan, Republic of China
IEE Proc-Vis. Image Signal Process., Vol. 146. No. 1. February 1999
method, which first determines the row (column) trans- forms of the input 2-D data, then transposes these intermediate results, and finally takes the column (row) transforms of the transposed data.
More efficient algorithms can, in general, be attained by working directly on the 2-D data by fully exploiting the 2-D characteristic of the kernel function. For exam- ple, Cho and Lee addressed a fast algorithm which decomposes an N x N DCT into N 1-D N-point DCTs [2], in contrast to the 2N N-point DCTs required for the row-column approach. Duhamel and Guillemot considered an efficient polynomial transform-based approach [3, 41. Wu and Paoloui proposed a recursive 2-D DCT algorithm [5] that extended Hou's work [6]. Kung and Lee considered a convolution-based algo- rithm that admits efficient hardware implementations [7]. Despite their efficiency, these fast algorithms, how- ever, are only applicable to the 2-D DCT of size 2' x 2'.
In this paper, we address a new fast algorithm for the recursive computation of the 2-D DCT of size p' x p', where p is a prime. The essence of the new algorithm is based on the 2-D to 1-D index mapping scheme, as addressed in [8]. This index mapping scheme has suc- cessfully been applied to the fast computation of the 2- D generalised discrete Hartley transform. A similar technique was also utilised to compute the 2-D discrete Fourier transform using the discrete Randon transform (or linear congruence) [9]. Based on this index mapping scheme, the proposed algorithm decomposes the 2-D DCT output components into three parts. In the first part, the output components whose indices are multi- ples of p form a pr-' x pr-' DCT. The remaining output components are further decomposed into two parts, depending on the summation of their indices. The latter two parts can be reformulated as a set of CC or SCC matrix-vector products by utilising the recently pro- posed maximum coset decomposition [lo, 111. Such a decomposition procedure can be repetitively carried out for the 2-D DCT of the first part by following the same steps. As a consequence, the Computation of the 2-D DCT is converted into a sequence of CC and SCC matrix-vector products of various sizes. Employing fast algorithms for the computation of these CCiSCC oper- ations, we can thus substantially reduce the numbers of multiplications required, compared with those of the row-column decomposition approach. In the special case when p = 2, the proposed algorithm can be further simplified, calling for computations comparable with those of previous works.
2 Fast algorithms for computing the 2-D DCT
For a set of 2-D data
{x(k,,
k2); 0<
kl I N - 1, 0 I k2 5 N - 1 }, the 2-D DCT is defined asx cos
4N
(1) for 0 I n l , n2, 2 N - 1, where the scale lactor, without
loss of generality, has been left out for notational brev- ity, and N = pr with p being a prime. Invoking the trig- onometric identity of cos(a) cos(p) = 'i2[cos(a
+
p)+
cos(a - p)], we can rewrite eqn. 1 as
1 X ( n 1 , n z ) = 5 [ X + ( n z , n 2 ) + x - ( n l , n 2 ) ] (2) where k1=0 k 2 = 0 cos (27r[(2kl
+
1)nl+
(2ka+
l ) n 2 ]1
cos ( 2 7 r P k l+
1)nl - (2k2+
l ) n 2 ] 4N (3) and N - 1 N - 1 X - ( n l , n z ) = x ( h , b ) k l = O k z = O 4N (4) Using the 2-D to I-D index mapping equations of (2k1+
1)nl f (2k2+
l ) n 2 E (2n+
3)25 mod 4N (5)( 2 k l + l ) n l f ( 2 k 2 + 1 ) n a E (2n+1)(25+1) mod 4N otherwise (6)
Xi-(nl, n2) and X-(nl, n2) can then be evaluated, respec- tively, using the following I-D expressions:
181, if n1
+
n2 is even X+(YL1,722)6
Y + @ ) anda
X ' - ( n 1 , n 2 ) = Y - ( n ) otherwise ( 7 ) if n1+
n2 is even -- I
otherwise ( 8 ) where y'(k) = X k 2 x(k,, k2), in which (kl, k 2 ) satis- fies the index mapping equations of eqn. 5 or eqn. 6,26
depending on whether n1
+
n2 is even or not. Thereby,X ( n l , n2) can now be rewritten as n X'(n1,nz) = Y ( n ) if n1
+
122 is even( s a )
- =1
7-
1 y(k) cos(
2 Q n + l l ( 2 k + 1 ) ) k = O (9b) otherwiseI
where Y(n) 1/2[Y+(n)
+
Y-(n)] and y ( k )k
1 / 2 b C ( k )+
y-(k)]. Next, we address the issue of solving the 2-D toI-D index mapping equations (eqns. 5 and 6). To solve these, n l , n2 and n must be decided first. In other words, the index ( n l , n2) of the 2-D output component is first mapped to a predetermined I-D output compo- nent with an index n, denoted by (nl, n2)
+
n. Accord- ing to this (nl, n2)+
12, a solution set of ( k l , k 2 ) on theleft-hand side of eqns. 5 and 6 can thus be obtained for every specific k on the right-hand sides of eqns. 5 and 6. y ( k ) can then be determined by a summation of x ( k l ,
k2) of these corresponding ( k l , k2)s.
To mitigate the overhead in the computation of y(k),
we consider those { X ( n l , n 2 ) } that share the same set of
{ y ( k ) } together. As all of the corresponding (nl, n2)s
satisfy the congruence eqns. 5 and 6, they are related to each other by some factors. Therefore, if one of these indices, say ( n ? , n ; ) ,
+
0, then the other indices ( n l ,n2)s inside this set can be determined as a function of ( n ? , n ; ) and n. Such a particular (n!, n ! ) is thus referred to as a dominant point for the corresponding set of { X ( n , , n 2 ) } . Then, as shown in the Appendix, it follows that the rest of { X ( n l , n 2 ) } , which share the same set of { y ( k ) } , are related to Y(n), n = 1, 2, ..., ( N - 1)/2 ~ 1, by
Y(72)
=x
( ( ( 2 n+
l ) n ? ) , ( ( a n+
1)ni))
(10) where ( . ) stands for either mod 4N or mod 2N, depending on whether eqn. 5 or eqn. 6 is employed. Note that, as both 4N and 2N are even for all N , it is easy to justify that, if a dominant point ( n y , n ! ) has a summation of indices ny+
1202 that is even(odd), thenthe corresponding { X ( n l , n 2)} has summations of indi- ces n l
+
n2 that are all even(odd).Some output components, however, appear repeti- tively for different dominant points, if (2n
+
1 ) is a multiple of p . For example, if N = 9 and for the domi- nant point (1, 1)-+
0, we obtain a corresponding out- put component set as Y(0) = X(1, l), Y(1) = X ( 3 , 3),Y(2) = X ( 5 , 5 ) and Y(3) = X(7, 7). On the other hand, the dominant point (1, 5) 4 0 will generate another output component set as Y(0) = X(1, 5 ) , Y(1) = X ( 3 , 3),
Y(2) =
X(5, 7),
and Y(3) =X(7,
1). We can note thatX ( 3 , 3) is repetitively computed, thus increasing the computational load. To avoid this redundant coniputa- tion, n that satisfies (2n
+
1) = mp, or equivalently, when both indices of the outputs are multiples of p owing to eqn. 10, will be treated separately. Given this fact, along with the index mapping eqns. 5 and 6, we can classify the computation of the 2-D DCT in the following three categories, according to the output indices:Part I : {X'(nl, n2): n1 m o d p = n2 m o d p = 0}
Purt 2: { X ( n l , n2): n l mod p # 0 or n2 mod p # 0, n l
+
n2 is even)
Part 3: { X ( n l , n2): n1 mod p f 0 or n2 mod p f 0, n1
+
n2 is odd)
For the output components of part 2, using the max- imum coset decomposition as addressed in [lo, 111, the corresponding 2n
+
1 (2n+
1 # mp) can be obtained by a generator that is an odd prime. More specifically, such a set of numbers can be generated by a power of g mod 2N as GI = (go, g ' , ..., ga-'} mod 2N, where g is the coset generator with orderA
that satisfies cos(2d 4N) = f cos(2ngr/4N) Therefore, for every dominant point, there are onlyA
(non-redundant) corresponding Y(n)s (or, equivalently, X ( n l , n2)s), whereA
= pr-' (p -1)/2, if N = p' [12]. Also, the (N
+
1)/2 input points in eqn. 9a can be divided into the following cosets: GI, G2 = pG1 = (p, pg,...,
pgmp'pl} mod 2N,...,
G, = pr-'G1mod 2N, and G,,, = {N}. The corresponding orders of these cosets are
A,
Alp, ..., and 1, respectively, and their summation is equal toA
+
Alp+
...
+
1 = (N+
1)/2.By replacing (2n
+
1) and k with f and p'gl, respec- tively, in eqn. 9a, we can rewrite eqn. 9a with ( N+
1)/ 2 inputs andA
outputs as((v))
(11) where m = 0, 1,
..., A
- 1, and ( . ) denotes mod 2Noperation. Expanding eqn. 11 yields the following more illustrative matrix notation:
=c
i = O
r
X
where C,
6
cos(2nk/2N) and we have used the fact thatCpigUp'+m = C i n, i = 0 ,
...,
r, m = 0 , ..., A - 1. AS such,the computation of the 2-D DCT output components in part 2 can be implemented by a sequence of CC matrix-vector products of sizes Alpi, i = 0,
...,
r .Note that, as 0 I ( g" ) mod 2N, some rectifications are made for the y(k) in eqn. 9a. If ( pign7 ) < (N
+
1)/2 then we simply use y(( pig" )); on the other hand, if (pig") 2 (N+
1)/2 we will use -y(N - ( pig" )), as read-ily justified using eqns. 5 and 6. Similarly, if ( g' ) < N, the output of eqn. 9b is Y(( (g' - 1)/2 )); otherwise, the
output will become -Y(n), where ( g i ) = 2N - (2n
+
1).( 2
IEE Proc.-Vis. Image Signal Proress.. Vol. 146. No. I , February 1999
Along the same line, the computation of the output components of part 3 can be simplified by following the above steps, except that the operation of mod 2N is replaced with mod 4N. More specifically, for every dominant point, the corresponding 2n
+
1 can be gen- erated by a generator g mod 4N as GI = (go, g', ..., ga-l} mod 4N, where g is the coset generator with orderA.
The ( N - 1)/2 input points in eqn. 9b can thus bedivided into the following cosets: GI, G2 = pG1 = (p, pg,
...,
pgUp-'} mod 4N,..,
and G, = p"GI mod 4N. The corresponding orders of these cosets areA, ...,
andAlW-l),
respectively, and their summation is equal toA
By replacing (2n
+
1) and (2k+
1) with gm and p'g', respectively, in eqn. 9b, we can rewrite eqn. 9b with( N - 1)/2 inputs and
A
outputs as+
Alp+
...+
Alpr-l = ( N - 1)/2.7 7
f
/ g m - 1 \ \(13) where m = 0, 1,
..., A
- 1 and ( . ) denotes mod 4Noperation. As above. we can also exmess ean. 13 as the following matrix notation:
X
...
. . .
. . .
. . . (14) whereck
& cos((2&)/4N), - and we have used the factthat
CppP1trn
= - C p l p , i = 0 ,...,
r , rn = 0, ..., A - 1. Assuch, the computations of the 2-D DCT outputs in part
3 can be implemented by a sequence of SCC matrix- vector products of sizes Alpi, i = 0 ,
...,
r.Likewise, the input and output indices are required to be relocated if they are outside the range. If ( ((pigm)
- 1)/2 ) < (N - 1)/2, then the input is y(((pig") - 1)/2);
on the other hand, if ( pig" ) 2 (N - 1)/2, we will use
the input as -y(N - ((pig"))/2) in eqn. 9b. Also, if ( (g' -
1)/2 ) < N, the output of eqn. 9b is Y(( (g' - 1)/2 ));
otherwise, the output will become -Y(n) or Y(n), depending on whether ( g' ) can be expressed as 2N f (212
+
1) or 4 N - (2n+
1).From the above discussion, we can note that the algorithm initially requires (N2(P2 - l))/p2il dominant
points that, in turn, generate a total of N2(1 l/p2) out- put points. Such a decomposition procedure can be recursively carried out for the 2-D DCT of part 1. The overall steps can be summarised as the following algo- ri thm for computing the 2-D DCT of size p r x p r (with initial value i = 1):
Srep I : Divide the output components into three parts, according to their indices. If both of the indices nl and n2 are multiples of p (part l), these {X(nl, n2))
form a 2-D DCT of size pr-' x pr-'. The rest of the out- put components can be further classified into two parts, part 2 and part 3, depending on whether the summation of nI/@'-l)
+
n2/(pi-') is even or odd, respec- tively.Step 2: The { X ( n l , n2)} in parts 2 and 3 can be fur- ther classified into several groups with different choices of dominant points. Every group of {X(nl, n2)} in parts 2 and 3 can be implemented using the CC operation of eqn. 12 and the SCC operation of eqn. 14, respec- tively. The resulting output components can be obtained by using the index mapping scheme of eqn. IO. The pr-' x pr-' DCT of part 1 can be further decomposed along the same steps by going back to step 1, with i = i
+
1 until i = Y.As a special case when p = 2, eqn. 9a and b can be simplified as
otherwise
(15) Employing the maximum coset decomposition as dis- cussed above, we can obtain
I
(
(
?
)
)
if n;+
722 = even (16) and 27rpzg A - 1k'
((
?))
=CY
( ( P 2 S r n ) ) cos(
4 N)
1 =O otherwise (17)
where m = 0,
...,
A
- 1, and (.
) denotes mod2N
and411' in eqns. 16 and 17, respectively. As above, eqns. 16 and 17 can be implemented using a sequence of SCC matrix-vector products. These SCC operations can be
r
efficiently implemented via the algorithm of [IO], with
y ( k ) as the input and the odd-indexed output compo- nents as the corresponding output.
To follow, we consider two examples to reinforce the proposed algorithm.
Example 1 ( 5 x 5 DCT): As N is equal to 5 (p = 5
and Y = 1) in this example, there are N2(1 - lip2) = 24
output components to be determined in the first itera- tion. If we choose the generator g = 3, the correspond- ing order il can be easily found to be 2. This implies that a total of 12 dominant points, which can be (arbi- trarily) chosen as (0, l), (1, 0), (1, 2), ( I , 4), (4, I), (4, 31, (0, 21, (1, I), (1, 31, (2, 01, (2, 2) and (2, 41, are required. Then, we determine the corresponding { y ( k ) } for the above dominant points by using the index map- ping schemes of eqns. 5 and 6.
For the dominant points (0, 2), (1, l), (1, 3), (2, 0), (2, 2) and (2, 4), as the index summations are even, eqn. 5 is employed. As an example, for the dominant point (1, l), y ( k ) = 1 / 2 b + ( k )
+
y-(k)], k = 0, 1, 2, wherey+(k) and y-(k) can be determined using eqn. 5 as fol- lows: y'(0) = - .r(0,4) - x(1,3) - ~ ( 2 . 2 ) - ~ ( 3 . 1 ) - X(4,O)
+
J44,4) - X ( 2 , l ) - 5(2,3) - 2(3,0) - 4 3 . 2 ) - 2(4,1)+
2 ( 4 , 4 )+
Z(2,l)+
2(2,3)+
2(3,2)+
2(3,4) - 2(4,0)+
z ( 4 , 3 ) y + ( 2 ) =+
Z ( 0 , l ) - z ( 0 , 2 )+
Z(1,O) - z ( 1 , l ) - 4 2 , O ) - 2(2,4) - 2 ( 3 , 3 )+
4 3 . 4 ) - 4 4 . 2 )+
2(4,3)+
4 2 . 0 )+
c ( 2 , 4 ) - z(3,O)+
z ( 3 , l ) - 4 4 . 1 )+
2(4,2) y-(0) = + 5 ( 0 , 0 ) + z ( l , l ) + 2 ( 2 , 2 ) +2(3.3) y'(1) = + z ( O , O ) - x ( 0 , 3 ) -.r(1,2) - z ( 1 . 4 ) ~ ~ ( 1 ) =+
~ ( 0 , l ) - 2(0,4)+
z(1,O)+
~ ( 1 , 2 ) ~ ~ ( 2 ) = + ~ ( 0 . 2 ) -~c(O,3) + ~ ( l , 3 ) - ~ ( 1 , 4 )An illuminative picture to demonstrate the above computation is shown in Fig. 1. For every one of these dominant points, the corresponding { Y( n)}, which shares the same { y ( k ) } , can be determined using the CC matrix-vector product of eqn. 12, as follows:
I.'(O) cos+
[
y(1)]
[1'(1)1 = c o s 5 4 2 )
+
[cos%
I
[-Y (0)l (18) The resulting { X ( n l , n2)} can then be determined using eqn. IO. Following the above for the dominantpoint (1, l), we obtain
Y ( 0 ) =
X(n(:,ni>
= X(1,l) Y ( l ) =x
( ( 3 4 , ( 3 n i ) ) = X ( 3 , 3 )Similarly, for the dominant points (0, l), (1, 0), (1, 2), (1, 4), (4, 1) and (4, 3), eqn. 6 should be employed to determine { y ( k ) } . For every one of these dominant points, the corresponding { Y(n)}, which shares the
same { y ( k ) } , can be determined using the SCC matrix- vector product of eqn. 14, as follows:
(19) The resulting outputs {X(nl, n2)) can then be obtained using the index mapping eqn. 10. In the
second iteration, there is only one output component, i.e. X(0, 0), that needs to be determinedd-,which can be easily determined by X(0, 0) = CC:,, C k , = ,
x(k,,
k2). Abutterfly-like structure depicting the above computing procedure is shown in Fig. 2 .
Example 2 (4 x 4 DCT): As
N
is equal to 4 (p = 2and r = 2) in this example, there are N2(1 - l/(p2)) = 12 output components to be determined in the first itera- tion. If we also choose the generator g = 3, the corre- sponding order
d
can be easily found to be 2. This implies that a total of six dominant points, which can be (arbitrarily) chosen as (0, l), (1, 0), (1, 2), (2, l), (1, 1) and (1, 3 ) , are required.For the dominant points (1, 1) and (1, 3), the index summations of the associated { X ( n l , n?)} are even, and the corresponding { Y ( M ) } can be determined using the
CC matrix-vector product of eqn. 12, as follows:
whereas, for the dominant points (0, I), (1, 0), (1, 2) and (2, l), the index summations of the associated f X ( n , , nz)} are odd and the corresponding { Y(n)} can be determined using the SCC matrix-vector product of eqn. 14, as follows
In the second iteration, three output components, namely X(0, 2), X(2, 0) and X(2, 2), can be determined by
[Y(O)I
= Y ( 1 ) (22)(23) and
[Y(O)I
= [cosy
1 [!@)I
In the third iteration, X(0, 0) can be easily obtained by the summation of all inputs
{x(k,,
k 2 ) } . A butterfly- like structure depicting the above computing procedure is shown in Fig. 3 . ~ ( 0 . 0 ) - x(3,O) xiO.3)- x(3,3) x (1 ,I 1 - x ( 2 3 1 x(1,2)- x12,2) xl0,l)- ~ ( 3 . 1 ) xi0,2) ~ ( 3 . 2 ) x(1.0)- X(2,O) ~ ( 1 , 3 ) - xi2,3) x(O,O)+x(3,0) x(0,3)* x (3,3) xil,l)* x (231 xi1,2)* x(2,2) x(1,Ob x (2.01 x(1,3)* x(2,3) xlO,l)* ~ 1 3 . 1 ) x(0,2)+ x (3,21 Xi1,O) X(3.0) X(1,2) X ( 1 , I ) x (3.3) -X(3,21 X(1,3) -X(3,1) X ( 2 , I ) -X (2.3) x (0.2) X(2.0) X(0,O) X(2,2)=
. . . Fig.?algorithm Structure for computing the 4 x 4 DCT by using proposed fast For the computation
of
the2-D
inverseDCT
(IDCT), the fast algorithm addressed above can also be employed, with slight modifications. The 2-D IDCT for the transformed data { X ( n l , n2)} is given by
(24) for 0 5 k l , k2, 5 N - 1, where the scale factor has again
been left out for brevity, and N = p‘ with p being a
30
prime. lJsing the same trigonometric identity, eqn. 24 can be rewritten as 1 N - l N-l 4 k l . k 2 ) =
5
X ( k 1 , k a ) n1=0 n2=0)
)
27r[ ( a h
+
l ) n 1+
(2k2+
l ) n 2 ](
4N x cos 1 IV--l N-l+
5
X ( k l , k 2 ) n1=0 n2=0 27r [(2k1+
1)nl - (2k2+
l)n.L] 4N (25)(
x cosUsing a modified 2-D to I-D index mapping equa-
( 2 k l
+
1)nl f ( 2 k 2+
l)nz ( 2 k+
1)2n mod 4 N tion, as follows:(26) we can re-express eqn. 25 as
z ( h , k 2 ) = y(k) A‘-1 = Y1(n)cos n=O N-1
+
Yz(n) cos n=O (27) where Y1(21 = Cn2 X ( n l , nz) in which ( n , , n2)s satisfy the index mapping equation of eqn. 26 with plus(mi- nus) sign. Employing the maximum coset decomposi- tion scheme addressed above, eqn. 27 can be rewritten as((V))
(28) Expanding eqn. 28 into a matrix form, we can note that the computation of { y ( k ) } is equal to a summation of CC and SCC matrix-vector products, such as those of eqns. 12 and 14, respectively. In a similar manner, the resulting
{x(k,,
k 2 ) } can then be obtained from { y ( k ) } using the index mapping, as follows:>-
( ( 2 k+
1)(2kY+
1)) - 1Y@)
= x((
2>>
( ( ( 2 k+
1)(2k!+
1)) - 1 2 (29) where (k’i, k ! ) is the chosen dominant point, and ( denotes mod 2N operation.3 Comparison and discussion
In this Section, we compare the computational com- plexity of the proposed fast algorithm with that of existing ones. Because the algorithm recursively )
computes the 2-D DCT of size N x N ( N = p r ) , it is straightforward to justify that the numbers of multipli- cations MUL(N) and additions ADD(N) are
x
r-i-I j = O A - - ' I - ' \(
p r - lI
p r - j - l J J + N ( 3 + : ) - 2 (31) respectively, where Mcc(m) (Acc(m)) and Mscc(m) (Ascc(m)) stand for the numbers of multiplications (additions) required for the CC and SCC matrix-vector products of size m, respectively.Table 1: Comparison of row-column decomposition approach with proposed one for computing
Algorithms Complexity 3 x 3 5 x 5 7 x 7 9 x 9 Row-column number of 18 50 112 180 multiplications number of 24 130 420 612 additions Proposed number of 8 30 64 80 multiplications number of 27 169 523 1059 additions
In the special case where p = 2, the recursive equa- tions for computing the required numbers of multipli- cations and additions can be simplified, respectively, as
(32) and T - 2 I N 3N
N
ADD(N)
=l5
(=log, - - - z=o;)
4 x 2 2 log, 2 j+
- 4 N 3 x 2J j=o+-
2 x 2 2(E
4
r - I r - 1 +N2(log2 N+
1)+
c(2')' - 2pz+
4 2=2 z = 1 (33) Table 1 provides the numbers of multiplications and additions required for computing the p r x p r (JJ is anodd prime) DCT using the proposed fast algorithm and the row-column decomposition approach, whereas other existing fast algorithms are only applicable to p = 2. The 1-D DCT algorithms used in the row-column decomposition approach are based on the ones of [6], when p = 2, and of [13], when p is an odd prime, whereas the CCiSCC matrix-vector products employ the algorithms of [14]. In Table 2, we also compare the computational complexity with other existing fast algo- rithms when p = 2.
As we can observe from Table 1, the proposed algo- rithm requires fewer numbers of multiplications (at the price of more additions), compared with those of the row-column decomposition approach. In the special case when p = 2, we can note from Table 2 that the proposed fast algorithm calls for the same numbers of multiplications and roughly the same numbers of addi- tions as those of [2, 41, which are known to be by far the most parsimonious approaches. As
a
whole, the proposed fast algorithm possesses the following advan- tageous features:(a) The new algorithm is more general than the previ- ous ones in that it is also applicable to the 2-D DCT of size p" x p r , where p is any prime. Although the current standards are based on the 2-D DCT of power-of-two data size, it has been observed that using a 2-D DCT of general data size can reduce the annoying blocking effect in the DCT-based image coding problems; see, for example, [15] for a discussion on lapped transforms.
(b) In contrast to the row-column decomposition
approach, the new one does not require the transposi- tion manipulation. In addition, the outputs correspond- ing to every dominant point can be computed independently, thus allowing for efficient parallelipipe- lined implementations.
(c) The 2-D to I-D index mapping scheme employed can be readily extended to the higher-dimensional case. In contrast, the previous ones, as addressed in [2, 161, are specifically designed for the 2-D DCT and do not allow for any extension.
Table 2: Comparison of various fast algorithms for computing 2' x 2' DCT
Algorithms Row-column [31 I21 [I61 Proposed
Data size Mult. Add. Mult. Add. Mult. Add. Mult. Add. Mult. Add.
4 x 4 32 74 16 68 16 74 16 74 16 74 8 x 8 194 464 96 484 96 466 112 472 96 466 1 6 x 16 1024 2592 512 2531 512 2530 640 2624 512 2530 32 x 32 5120 13376 2560 12578 2560 12738 3328 13504 2560 12738 64 x 64 24576 65664 12288 60578 12288 61314 16384 66174 12288 61314 ~~ ~ ~~ ~~ ~
4 Conclusions
In this paper, we propose a new fast algorithm for the recursive computation of a p r x p r DCT, where p is a prime. The algorithm recursively decomposes the out- put components into three parts according to their indices. Such a classification of outputs can, in turn, be reformulated as a set of CC and SCC matrix-vector products of various sizes. Utilising existing fast algo- rithms for the computation of these CC and SCC matrix-vector products, we can thus obtain an algo- rithm with minimum multiplicative complexity. A simi- lar fast algorithm for computing the 2-D IDCT is addressed as well.
5 Acknowledgments
This research was supported by the National Science Council of the Republic of China, under contract NSC87-2213-E-011-014. The authors would like to thank the reviewers, whose comments have enhanced the quality and readability of this paper.
6 1 2 3 4 5 6 7 8 9 References
RAO, K.R., and YIP, R.: ‘Discrete cosine transform’ (Academic Press, San Diego, CA, 1990)
CHO, N.I., and LEE, S.U.: ‘Fast algorithm and implementation of 2-D discrete cosine transform’, IE EE Trans. Circuits Sysf., 1991, 38, (3), pp. 297-305
DUHAMEL, P., and GUILLEMOT, C.: ‘Polynomial transform conmutation of the 2-D DCT’. Proceedines of IEEE international confkrence on Acoust., speech, signal uprocessing, ICASSP90, Albuquerqe, NM, April 1990, pp. 1515-1518
DUHAMEL, P., and GUILLEMOT, C.: ‘A polynomial-trans- form based computation of the 2-D DCT with minimum multipli- cative complexity’. Proceedings of IEEE international conference on Acoust.. speech, signal processing, ICASSP’96, Atlanta, GA, May 1996, pp. 1347-1350
WU, H.R., and PAOLOUI, F.J.: ‘A two-dimensional fast cosine transform algorithm based on Hou’s approach’, IEEE Trans. Sig-
nal Process.. 1991, 39, (2), pp. 544546
HOU, H.S.: ‘A fast recursive algorithm for computing the dis-
crete cosine transform’, IEEE Trans. Acoust. Speech. Signal Proc- ess., 1987, 35, (IO), pp. 1455-1461
KUNG, S.S., and LEE, M.H.: ‘An expanded 2-D DCT algorithm based on convolution’, IEEE Trans. Consum. Electron.. 1993, 39, ( 3 ) , pp. 159-165
HU, N.-C., and LU, F.-F.: ‘Fast computation of the two-dimen- sional generalized Hartley transforms’, I E E Proc. Vis. Image Sig- nal Process., 1995, 142, ( I ) , pp. 35-39
GERTNER, I.: ‘A new efficient algorithm to compute the two- dimensional discrete Fourier transform’, IEEE Truns. Acoust.
Smech Sipnal Process., 1988, 36. (7). DP. 1036-1050
10 HU, N.-e.., and LIN, K.-C.: ‘Skew~~rcular/circular correlation decomposition of prime-factor DCT’, IEE Proc. Vis. Image Signal Process.. 1995, 142, (4), pp. 241-246
11 HU, N.-C., and LOUH, S.-W.: ‘Two-stage decomposition of the DCT’, IEE Proc. Vis. Image Signal Process., 1995, 142, ( 5 ) , pp.
3 19--326
12 McCLELLAN, J.H., and RADER, C.M.: ‘Number theory in digital signal processing’ (Prentice-Hall, Englewood Cliffs, NJ, 1979)
13 HEIDEMAN, M.T.: ‘Computation of an odd-length DCT from a
real-valued DFT of the same length’, IEEE Trans. Signal Proc-
ess., !992, 40, (l), pp. 5 4 6 1
14 NUSSBAUMER, H.H.: ‘Fast Fourier transform and convolution algorithms’ (Springer-Verlag, New York, 1982)
15 MALVER, H.S.: ‘Signal processing with lapped transforms’ (Artech House, Norwood, MA, 1992)
16 CHO, N.I., and LEE, S.U.: ‘A fast 4 x 4 algorithm for the recur- sive 2-D DCT’, IEEE Trans. Signal Process., 1992, 40, (9), pp. 216G2173
7 Appendix
In this Appendix, we show the index mapping eqn. 10. For brevity, we only prove
(2kl
+
l)nl+
(252+
l)n2 E (an+
1)2k mod 4 N if n1+
122 even (34)32
( 2 k l + l ) n 1 + ( 2 k 2 + l ) n 2
=
( 2 n + l ) ( 2 k + l ) mod 4 N otherwise (35) The negative case of eqn. 10 follows straightfor- wardly. We begin our discussion with eqn. 34. Recall that the solution of a linear diophantine equation a l x l+
4 x 2 == b can be expressed by the following paramet-ric expression: ( 3 6 ) a2 d
q ( t )
= q ( 0 )+
-t
a1 d Q ( t ) = 2 2 ( 0 ) --t
( 3 7 )where xI(0) and x2(0) are particular solutions, d is the greatest common divider (GCD) of al and a*, and t is an integer. As ( n ; ,
n i )
is now a dominant point with n = 0, from eqn. 5 the indices k l , k2 and k need to satisfy the following expression:( 3 8 ) A particular solution k l ( 0 ) and k2(0) to eqn. 38 can be easily determined by choosing k l ( 0 ) = 0 and then solving eqn. 38 for k2(0), as follows:
(251
+
l)n:+
(2k2+
l ) n i = 25 mod 4 Nwhere I is an integer. Employing eqns. 36 and 37, we can then obtain the general solution of eqn. 34 as
2 4 d k l ( t ) =
-t
2k - ny -n:
+
4 N l 2 4 2nY d --t
( 4 1 ) k 2 ( t ) =where d = GCD(2n7, 2n!), and we have used the par- ticular solution k2(0) of eqn. 39. Plugging the solutions k l ( t ) and k2(t) of eqns. 40 and 41, respectively, in eqn. 34 and comparing both sides, we can obtain
( 2 n l ) ( 2 n i ) t - ( 2 n 2 ) ( 2 n ? ) t =
o
( 4 2 ) andn?
-( 4 3 ) It then follows that, for the outputs whose indices ( n l ,
n2)s satisfy
( 4 4 )
122 ( 2 n
+
l)n: mod 2 N ( 4 5 ) n1 z (an+
l)ny mod 2 Nand
then these ( n l , n2)s share the same { y ( k ) } .
Similarly, when n1
+
n2 is odd, eqn. 35 is employed. We can also (arbitrarily) choose a particular solution askl(0) = 0 and k2(0) = (2k
+
1 - ny ~ n!+
4Nr>/2n;,where I is an integer. Then, we can determine the gen- eral solution to eqn. 35 as
an: d k , ( t ) =
-t
and 2k - ny -ni
+
4N1 2nf --t
( 4 6 ) 2 4 d kZ(t) =Substituting eqn. 46 into eqn. 35 and equating both sides yields
( 2 n l ) ( 2 n i ) t - ( 2 n 2 ) ( 2 n y ) t =
o
( 4 7 )(49)
and n1 G
(2n
+
1)n: mod 2NIt then follows that, for the outputs whose indices then these (n,, n2)s share the same { y ( k ) } . Thus it com- pletes the proof.
(n,, n2)s satisfy