Two-stage circular-convolution-algorithm/architecture for the discrete cosine transform

(1)

Two-stage circular-convolution4 ike

algorithm/architecture for the discrete cosine

transform

W.-J. Duh J.-L.

W U

Indexing terms: Signal processing, Algorithms, Transforms

Abstract: Because of the great improvements in computer engineering, digital signal processing has been getting more and more important in recent years. Since discrete transformations play a significant role in digital signal processing, they have found many applications in various fields. The discrete cosine transform (DCT) is well known for its usefulness in the fields of image processing and data compression. With recent advances in the ISDN, limited communication bandwidth has become a new bottleneck, a pos- sible solution to which may be an efficient encod- ing algorithm/architecture. The paper presents a two-stage algorithm and its corresponding architectures for efficient computation of a power-oftwo length DCT. In this approach, the transform matrix of the DCT is decomposed into the product of two matrices, the preprocessing and the postprocessing ones. The elements in the preprocessing stage consist of 1, - 1, and 0 only; the postprocessing stage is of block diagonal form in which each block performs a circular-convolution- like (CCL) operation. Thus, both stages can be implemented efficiently either by software or hardware. Details of the matrix decomposition are described and several corresponding architectures are also presented.

1 Introduction

Since its introduction the discrete cosine transform (DCT) [l] has found a number of applications in image processing [2] and in speech processing [3]. It has also been shown that the DCT has a performance very close to the statistically optimal Karhunen-Ldve transform (KLT) for a large number of signal classes [4, 51. There- fore, many fast algorithms for computing the DCT have been derived. A comprehensive review of various fast DCT algorithms was given in Reference 6 and a few novel recursive DCT algorithms were presented in Refer- ences 7 and 8. Since they are mostly used in various signal-processing applications, DCT processors with real- time computation capabilities are urgently required. DCT architectures have been proposed to meet this requirement, such as the distributed arithmetics DCT of Sun er al. [SI, the concurrent DCT [lo] and the constant-rotation DCT of the authors [I 13.

Paper 7514F (E5), first received 19th December 1989 and in revised form 24th April 1990

The authors are with the Department of Computer Saence and Infor- mation Engineering, National Taiwan University, Taipei, 10764, Taiwan, Republic of China

IEE PROCEEDINGS, Vol. 137, P t . F , N o . 6, DECEMBER 1990

In this paper, a two-stage algorithm/architecture for computing a power-of-two length DCT is proposed. The preprocessing stage consists of values 1, - 1, and 0. The postprocessing stage is of block diagonal form with maximum block size N/2. Each block is of circular- convolution-like (CCL) form. The difference between the circular-convolution and the CCL lies in the sign of the elements. The postprocessing stage is totally parallel among blocks and can be realised by using the modified contemporary circular-convolution architecture [ 12, 131 within each block. Thus, both the pre- and postprocessing stages can be implemented efficiently. The system block diagram of the proposed algorithm/architecture is shown in Fig. 1.

input dato

Fig. 1 System block diagram oftwo-stage C C L D C T

2 Two-stage matrix decomposition

The N-point DCT is defined as follows:

output coefficients

at

where N is the transform length, n, k E (0, I ,

. .

., N - l},

and

i f n = O (1, otherwise

The corresponding inverse transformation is

N- I

x(k) =

1

c(n)y(n) cos

" = O (3)

For the purpose of simplicity, the transformation is rep- resented in matrix-vector form. The DCT transform matrix, with length N, is denoted as C,. An input column vector X = [xo, xlr

. . .

, xN-

11'

is transformed to

(2)

the output vector, denoted by Y, as

where D, = ( 2 / N ) diag (c(O), c(l),

..

.,

c(N - 1)).

in this paper is focussed on the transform matrix C,.

2.1 Index partitions

Consider the E T with power-of-two length only, that is N = 2", where m is a positive integer. Some definitions and lemmas for the index partition are given below.

Definition I: Z , stands for the set of integers (0, 1, .

. .

, 2" - l}, where m is a positive integer.

Lemma I: The integer set Z , is the disjoint union of the subsets A i :

wherej E Z,,-,-

,

and i E (0, 1, . . . , m - l}, and A, = {0}, i.e.

Y

= D,C,X (4)

For the purpose. of simplicity, the algorithm discussed

A i = {2'(2j

+

1))

Z" =

IJ

A i m

i = O

and A i n A j =

4,

for i # j .

Proof: See Appendix 8.1.

For example, when m = 4, the set Z4 can be decomposed

into A , = (1, 3, 5 , 7, 9, 11, 13, 15}, A , = ( 2 , 6, 10, 14}, A, = {4, 12}, A, = (8) and A, = (0).

Definition 2: Let g ( x ) = 2x

+

1, where x E Z,-

,

and g is

a mapping from Z, ~

,

to A , .

Since the transform kernels of DCT are cosine functions, they possess the same symmetric properties, such as cos ( n / 2 - 0) = -cos (n/2

+

0) for 0 E CO, n / 2 ] . Using the symmetric property, a functionf(g(k), n) can be defined to explore the symmetry of the transform matrix.

Definition 3:

( 5 )

where k, n, and f(k, n) E (0, 1, ..., N - 1, N}. The function abs (.) denotes absolute value and the operation mod,, is similar to that of modulo 2N except that the valid range is in { - N

+

1,. . ., - 1,0, 1,.

.

., N}.

f (k, n) = abs (kn mod,,)

The definition of f(g(k), n) is to formalise the index mapping of the DCT. We ignore the sign of the cosine functions in the transform kernels: only the absolute

values are considered. Hence, the argument k, n of functionf(g(k), n) denote the input and output indices, respectively, and the corresponding value is the angle to be computed in the transform. As indicated in definition 3, only the absolute values are involved. Thus, the sign signals are treated as additional control signals for the convolvers and are described in detail in the next Section. With the aid of the index mapping function defined in definition 3, we proceed to partition the index space to achieve higher order parallelism.

Lemma 2: For all n E A i and k E A , , the values of f(k, n) E A i .

Proof: See Appendix 8.2.

Lemma 2 explores the parallelism implied in the transform kernels of the DCT. That is, to compute any output value fin), only those angles in A i are required where n E A i . Since lemma 1 forms an index partition of the

index set Z, into m

+

1 subsets with maximum size N / 2 , the transform matrix C, can be decomposed into m

+

1 independent submatrices which can be computed totally in parallel.

2.2 Transform matrix decomposition

For simplicity, let

e,

= QCN where Q is a proper permutation matrix for output indices. In the following, sub- scripts are used to denote the corresponding ranks of the matrices. The matrix

e,

can be factorised as follows [14]:

e,

=

[

' " / 2 'N/2 'kO] RN/2 - R N / 2 'k/Z

where Iki2 denotes the opposite diagonal identity matrix. It is clear that the submatrix can be factorised recur- sively. The submatrix is left unfactorised as the maximum block of the postprocessing stage. After (log, N) - 1 iterations, the matrix is decomposed into

two stages. The preprocessing stage consists of data reversing and butterfly operations only, however the postprocessing stage is of block diagonal form. In eqn. 6, the second matrix on the right-hand side performs the

1

butterfly omrations often seen in FFT algorithms, and

the third matrix will reverse the lower half data.

for example,

e,,

becomes

For simplicity, we denote cos (8n/32) as C,. Therefore, -

CO CO CO CO CO CO CO CO CO CO CO CO CO CO CO CO

c, -c, -c,

c,

c, -c,

-e,

e,

C8

-e, -e,

c,

-c,

c,

c4

c,,

-c12

-c4

-c12

-c4

c4

c,,

c4

c12

-c,z

-c4

-c12

-c4

c4

c12

-c4

c4

- c l 2

-c4

c12 -cl2

c4

c12

-c4

c4

-c,z

-c4

c12 -c12

c4

c2 c6 - c 1 4 - c 1 0 - c 2 c 2 c6 cl, c l , - c 1 4 - c 1 0 - c 6 - c Z '

6 -'I4 -'IO '10 ' 2 '14 6' -'I4 -'IO '10 '2 '14 6'

-cl, -cl0

c6

- c 2 c Z - c 6 - c 1 4 - c 1 0 c6 c2 - c 6 cl, - c 1 4

'10 -c2 c6 - c 6 - c 1 4 - c 1 0 -c2 - c 6 - c 1 4 c2 - c 1 0

c ,

e,

c,

e,, e,, e,, -e,,

-c,,

-e,, -e,

-c,

-e,

e,

c,

c,,

-cl,

-e,

c, e,

c,, -c,, -c, -c,

c,

c,, -e,,

-c,

e, -e,

c,,

-e,

-cl)

c, -e,,

c,

-c,

c,,

-e,

e,, e, -e,,

e,

-c,

c,

c,,

-e7

c3 c,, -c,

c ,

c,,

-ell

-e,

c, -c,,

-e,

c7

-cl,

-cs

c,,

e,, c,

e,

c, c,

e, -e, -e, -e,

-c, -c,

-e,, -c,,

-cl,

c,

e ,

-c,

-ell c,,

c,

e,

-c,

-e, -e,, e,, e, -e, -e,

-c,,

-e,,

c,

-cl,

-e,

e,, -e,

e,

-c,

c,

-c,,

c, e,, -c3

c,,

-e,

c,

-c,

e,, c, -c,

e,, c,

-c, -e1,

e, -e3

-cl,

c, -c,

-e1,-

(7

(3)

Now, if the permutation matrix

Q

is chosen to be - 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 ~ 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0

From eqn. 6, a recursive algorithm for computing the preprocessing stage can be derived easily. Fig. 2 shows the signal flow graph of the resultant fast algorithm of the preprocessing stage of a 16-point example, Fig.

then the preprocessing stage of a 16-point DCT can be factorised as

From lemma 2, it is clear that the input indices g(k) E

the Output indices E and the

rotation a n g I e f ( d 4 n) E A , . Thus, according to the defi- nition of f ( g ( k ) , n), a binary operator can be defined as (9)

The corresponding postprocessing matrix is of block diagonal form as shown below:

-

c o o o

0 0 0 0 0 0

o c , o

0 0 0 0 0 0

o o c ,

c,,o

0 0 0 0 0 0

c,, -c,

0 0 0 0 0 0 0 0 0

c,

c,, c,,

0 0 0 0 0

c,

-c1,

-c2

-c,,

0 0 0 0 0

c,, -c,

c,, c,

0 0 0 0 0

c,,

-c1,

c,

-c,

0 0 0 0 0 0 0 0 0

c,

0 0 0 0 0 0 0 0

c3

0 0 0 0 0 0 0

o . c , -

0 0 0 0 0 0 0 0

c5

0 0 0 0 0 0 0 0

c,,

0 0 0 0 0 0 0 0

c,,

0 0 0 0 0 0 0 0

c 7 -

- 0 0 0 0 0 0 0 0

c,,

is, with appropriate input/output permutations, each block in the postprocessing stage can be computed uti- lising the circular-convolution hardwares [12, 131.

Since A i corresponds to a block in the postprocessing stage of size 1 A i

1,

only A , , the largest block, will be dis- cussed here. Following the same strategy, other blocks can be constructed without difficulty.

z (15) z.(q3) ;(?I) z(9) z(7) 2(5)z(3) z (1 z ( l L ) 2 (10) z(6)z(Z) z (IZ)z(L) z(8) z(0) Fig. 2 Preprocessing stage ofl6-point two-stage C C L D C T

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 c3 c5 c7 c9 c l , cl, c15 c9 c l , - cl, - c5 cl c7 ‘1.3 - c 5 c 1 3 - c l - cl, c3 - c l , c7 CIS

-c,

c3 c13 -c9 Cl C l l cl, c l , ‘9 ‘7 5‘ ‘3 ‘1 c7 ‘1 - 5‘ - ‘11 ‘15 ‘9 ‘3

-cll

c3 - c l ,

-cl

c13 - c S c9 cl -c9 cl, c 3 - c 7 cl, c 5 .

2.3 Circular-convolution-like pOStProcessing stage _{Definition 4 :}

_{B N}

_{is a binary operator defined over the set}

One way to achieve fast computations is to decompose further the submatrices in expr. 10 [8, 141. In this paper, we present another approach from the view point of architecture rather than the number of operations. That

IEE PROCEEDINGS, Vol. 137, Pt. F , No. 6 , DECEMBER 1990

A , , The definition of

B~

is

where mod,, is defined in definition 3.

(11) @ N n = abs ( k n

(4)

After defining the operator @ N , the relationship between input and output indices has been explored. The following lemma illustrates the so-called circular-convolution- like (CCL) properties.

CCL block whose length is the size Of the block. Note that in definition 4 only the absolute values of the cosine functions are formalised, hence sign factors are excluded

&point CCL matrix

since the DCT kernels are used in the matrix and the input signals in the column vector, then summing column-wise will lead to the semisystolic array imple-

4-point matrix

Fig. 3 Postprocessing s t q p 4 1 6 - p o i n t C C L DC7

Lemma 3: G , = { A o ; generator of the group. Proof: See Appendix 8.3.

From lemma 3, the generator 3 can generate the sequence

of the cyclic group as shown in Table 1 [l5, 161. It is clear that the circular-correlation matrix is of the same form as given in Table 1 . It is well known that time- reversion of circular-correlation becomes circular- convolution. Therefore, if the output indices are permuted according to the sequence generated by 3 and input indices are permuted as time reversal of the new output orders, each block in the postprocessing stage is a

Table 1 : Operation table of a cylic group

B e a b c

is a cyclic group, and 3 is a

e e a b c

a a b c e

b b c e a

c c e a b

e denotes identity and a denotes generator

diagonal matrix with maximum block size N/2. Each block is of CCL form. The reason why we call it CCL is that the matrix is of circular-convolution form except for some sign changes. The CCL form can be formulated as follows :

N - 1

An)

=

1

sg (k, n)x((n - k) mod N)h(k) (12) k = O

where

An)

denotes the output, and x(.) and h ( . ) denote the two input sequences for convolving. The function sg(.) controls the type of convolution. When sg (.) = 1, eqn. 12 represents a circular convolution, otherwise, if sg (k, n) = sign (n - k), eqn. 12 becomes a skew circular convolution, where

1, if x

>

0

- 1 , i f x < O sign (x) =

The function sg (.) for CCL is defined in eqn. 14. It can be treated as the control signals of the CCL architecture

[12, 131. Note that the magnitudes of the postprocessing

functions are defined by the operator O N and the corresponding signs are defined by sg (.).

(14)

sg (k, n) = sign (g(k)n

+

h’ mod,,)

As the transform kernels of the DCT are defined on the

4N equal divisions of the unit circle, the function sg (.) is defined according to the fundamental properties of the cosine function. The first and the fourth quadrants are positive, the others are negative.

4

Since the postprocessing stage is composed of rn

+

1 CCL submatrices, some simple architectures, such as semisystolic [12] and systolic arrays, and transversal filters [13] can be used to calculate the results. The relationship among various hardware structures and the corresponding matrix representations will be discussed in detail in this Section.

4.1 Semisystolic arrays for the

CCL

matrix

The operation of the maximal CCL block of the 16-point DCT can be expressed as a matrix vector product as

Realisation considerations for the CCL matrix

I

follows:

3

From the preceding section, it is clear that in the proposed two-stage DCT, the postprocessing stage is a block

Additional controls of the convolver clock pulse. Since the circular conv&tion operation is

commutative, interchanging the roles of input signals and DCT kernels will still result in the desired DCT coefficients.

(5)

In practical usage, the transform kernels are fixed and input signals are changed with time. Therefore, specific multipliers can be used to reduce the complexities of the implementation. Table look-up and power-of-two

o u t p u t approximation [ 131 are good alternatives.

c3

GI

c5

c,

b

c7 Cl, Cl c - - - - - - -0 0 0 0 0 0 0 0 0 y ~ 1 ) ctrl-1 drl-n ctrl-7 ctrl-I3 ctrl-I5 ctrl-5 drl-9 ctrl-3 -(1Vz(l

Fig. 4 Semisystolic structure for realising the C C L matrix

Therefore the 1-D systolic array for computing CCL can be easily derived as shown in Fig, 5 , From Fig, 5 , it is

promote the utilisation, one can interleave two input obvious that the hardware is only 50%. To

Input signal broadcasung is used

4.3 Transversal filter structures for the CCL matrix

wise instead of column-wise leads to the transversal filter From the ‘CL matrix shown in eqn. ‘’7 ‘Ow-

4 2 Systolic arrays for CCL matrix

Eqn. 15 can be reformulated as

- c 1 0 0 0 C l l Cl 0 0

c7

-cll -c,

0 c l , c 7

-ell

c l c l , c l , c7 c l , ‘5 ‘15 ‘13 -‘7 ‘ 9 -‘5 -‘lS ‘13

c,

c9

-c,

Cl, 0 c3

c9

c5 0 0 c3 -c9 0 0 0 c3 0 0 0 0 0 0 0 0 0 0 0 0 - 0 0 0 0 U Fig. 5 rirrulafion

Systolic array for computing the C C L matrix using input signal

0 0 0 0 Cl

c,

1 c7 c l , c l 5 CS c9 c3 0 0 0 0 0 0 - 0 0 0 0 0 0 0 0 0 0 0 0 Cl 0 0

-cll

-c,

0 c7 - c 1 1 Cl c 1 3 c7 ‘11 c l , ‘13 -‘7 - c 5 - c 1 5 Cl, c9

-c5

c,,

c3

c9

cs

0

c,

-c9

0 0 c3. Cl 0 0 0 0 0 0 c3 Cl 0 0 0 0 0

c9

c3

-c,

0 0 0 0

c,

-c9

c,

0 0 0 ‘15 ‘5

“3

‘3 ‘1 0 ‘13 ‘15 - c 5 ‘9 ‘3 c l ‘7 c l , - c 1 5 - c 5 c9 c 3 - c l c l , - c 7 c l , c l , cS - c 9 ‘ 3 0 ‘11 c7 c l , ‘15 c 5 c9 0 -‘ll ‘7 ‘13 c 1 5 - c S 0 0 -cll c7 c l , - c l S 0 0 0 0 c l , - c 7 c 1 3 0 0 0 0 0 C l l c7 0 0 0 0 0 0 - c 1 1 0 0 0 0 0 0 0

IEE PROCEEDINGS, Vol 137, Pt F , N o 6 , DECEMBER 1990

- ~ -

(6)

twice the number of clock periods. The semisystolic array suffers from signal broadcasting and the shortcoming of the transversal filter is the requirement for global summation. It is clear that for low-speed applications the systolic array is a good choice, yet for high-speed usages the semisystolic array and transversal filter structures are more attractive. 0 Z(l1) 0 Z(7) 0 Z(13)O z(15)O z(5) 0 z(3) 0 z(3) out 1 1 0 ~(11) 0 y(7) 0 ~ ( 1 3 ) 0 ~ ( 1 5 ) -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0- 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 -0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0- U Fig. 6 Jicient circulation

Systolic arrayfor computing the C C L matrix using output coef

y(3) y ( l ) Fig. 7

circulation

Table 2 : Comparison between semisyatolic array, systolic arrav and transversal filter

Transversalfilterfor computing C C L matrix using input signal

Semisystolic Systolic Transversal filter operation mode =rial-in serial-in parallel-in

no. of clocks N 2 N - 1 N

no. of adders N N N - 1

disadvantage broadcasting low-speed global parallel-out serial-out serial-out

additional summation latches

5 16-point DCT example

From the discussions above, a two-stage decomposition of the DCT transform matrix has been derived com- pletely. The preprocessing stage consists of data permutations and butterfly operations; the postprocessing stage comprises several CCL blocks which can be realised by modified convolvers in parallel. To clarify the operation of the algorithm, a specific 16-point DCT example will now be given.

5.1 Two -stage decomposition

The 16-point DCT transform matrix C,, can be factorised into preprocessing, input permutation, postprocessing and output permutation stages as

- 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 ~ 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 (2

IEE PROCEEDINGS, Vol. 137, P t . F , No. 6, DECEMBER 1990 470

(7)

Table 3: Operation table of the 4-point CCL block showing Each block in the postprocessing stage is a circular- convolution-like (CCL) operation.

circular-convolution form

€ 3 1 5 7 3 The algorithm/architecture derived in this paper is

1 1 5 7 3 confined to the power-of-two length DCT. The upper

3 3 1 5 7 limit of the transform length is dependent upon current

7 7 3 1 5 VLSI technology. Since the implementation of the post-

5 5 7 3 1 _{processing stage is similar to that of a transversal filter,}

the upper limit on the transform length is about twice the uppe; bound on existing FIR filter skes.

additions/subtractions. The postprocessing stage can be implemented by using existing convolvers/correlators [12, 131. Since the coefficients of the CCL matrix are fixed, table lookup and mixed power-of-two approx-

imation are Flexibility is one of the

advantages of this proposed algorithm/architecture, that

the maximum block of the 2N-point DCT and one addi- From eqn. 18, it is obvious that the computation of _{The preprocessing}_{stage can be implemented by simple} the 16-point DCT can

be divided into four stages as

shown in Fig. 1. Hence? the for

computing the DCT becomes:

(a) compute the simp1e preprocessing stage ‘ p e . The fast computational algorithm is illustrated in Fig. 2 and described in eqn. 9.

eqn. 21

CCL hardwares described in the Drevious Section. The

(b) permute the as given in

(cl

compute the postprocessing stage, say ‘PS~.‘, using

Of ‘P~P based On

is, if an N-point DCT is available, one needs only to add tional reversingbutterfly stage to form a 2N-point DCT, block diagram of the stage is shown’ in Fig. 3

(d) reorder the output coefficients based on Q , which is given in expr. 8.

In the 16-point DCT example, there are 8-point, 4-point, 2-point and two 1-point blocks. Each block corresponds to one of the subsets described in lemma 1. It can be easily verified that the 4-point block of the 16-point DCT is the same as that of the 8-point DCT. Therefore, each CCL block can be constructed using the same strategy. Since 3 is used to generate the output order of the 8-point CCL block, that is (3’, 3l, 3’, 33, 34, 3’, 36, 3 7 ) u m d p r ~ , a r the result is (1, 3, 9, 5, 15, 13, 7, 11). In other words, the relative output order within the block can be renumbered using g-’ (3‘ under @ 1 6 ) and becomes (0, 1, 4, 2, 7, 6, 3, 5) in this example; the corresponding input order is (1, 11, 7, 13, 15, 5, 9, 3), the time-reversal of (1, 3, 9, 5, 15, 13, 7, 11). Thus, the relative input order within the 8-point block is (0, 5, 3,6, 7,2,4, l}, the time-reversal of (0, 1,4,2, 7, 6, 3, 5). The block diagram of the example is illustrated in Fig. 8.

absolute relative relotive absolute order for order for order f o r order for permutotion permutation CCL lnput CCL Input

Fig. 8

6 Conclusions

Input orderingfor ( 0 , 5 , 3 , 6 , 7 , 2 , 4 , I }

7 References

1 AHMED, N., NATARAJAN, T., and RAO, K.R.: ‘The discrete cosine transform’, IEEE Trans., January 1974, C-25, pp. 90-93 2 JAIN, A.K.: ‘Image data compression: a review’, Proc. IEEE, 1981,

pp. 349-389

3 ZELINSKI, R., and NOLL, P.: ‘Adaptive transform coding of speech signals’, IEEE Trans., August 1977, ASP-25, pp. 299-309 4 HAMIDI, M., and PEARL, J.: ‘Comparison of the cosine and

Fourier transforms of Markov-I signals’, IEEE Trans., October 1976, ASSP-24, pp. 428429

5 CLARK, R.J.: ‘Relation between the Karhunen-Loeve and cosine

transform’, IEE Proc. F, Commun., Radar & Signal Process.. Novem- ber 1981,128, pp. 359-360

6 MAKHOUL, J.: ‘A fast cosine transform in one and two dimen- sions’, IEEE Trans., February 1980, ASSP-28, pp. 27-34

7 LEE, B.G.: ‘A new algorithm to compute the discrete wsine trans- form’, IEEE Trans., December 1984, ASP-32, (6), pp. 1243-1247 8 HOU, H.S.: ‘A fast recursive algorithm for computing the discrete

cosine transform’, IEEE Trans., October 1987, ASSP-35, pp. 1 4 5 s 1461

9 SUN, M.T., WU, L., and LIOU, M.L.: ‘A concurrent architecture for VLSl implementation of discrete cosine transform’, IEEE Trans., August 1987, CAS-34, (8). pp. 992-994

10 WU, I.-L., and DUH, W.-J.: ‘A novel concurrent architecture to implement discrete cosine transform based on index partitions’, to appear in Int. J. Electron.

11 DUH, W.-I., and WU, J.-L.: ‘A constant-rotation DCT architecture based on CORDIC techniaues’. lo avvear in In[. J. Electron.

..

12 ERSOY. 0.: ‘Semisystolic‘ array implementation of circular, skew

circular, and linear convolutions’, IEEE Trans., February 1985, C-34, (2), pp. 1-194

13 BOUSSAKTA, S., and HOLT, A.G.J.: ‘Prime-factor Hartley and Hartley-like transform calculation using transversal filter-type struc- tures’, IEE Proc. G., Circuits, Deuices & Syst., October 1989, 136, (5). pp. 269-277

14 CHEN, W.-H., HARRISON SMITH, C., and FRALICK, S.C.: ‘A fast computational algorithm for the discrete cosine transform’.

IEEE Trans., September 1977, COM-25, (9), pp. 1004-1009 15 GILBERT, J., and GILBERT, L.: ‘Elements of modern algebra’

(PWS Publishers, Boston, USA, 1984)

16 SHANKS, D.: ’Solved and unsolved problems in number theory’ (Chelsea Publishing Co., NY, 1978)

8 Appendix

8.1 Proof of lemma 1

we

complete the proof of L~~~~ 1 by using ~ ~

4

l ~ ~ * ~ function,

A novel, fast, parallel algorithm/architecture for computing the DCT has been proposed. The derivation of the fast algorithm is based on the decomposition of the DCT transform matrix into pre- and postprocessing stages. The preprocessing stage consists of

+

1, - 1, and 0 and it (including (log, N) - 1 half data-reversing stages and d‘N log, N stages for butterflies). T h e postprocessing stage is a block-diagonal matrix with maximum block size N/2.

IEE PROCEEDINGS, Vol. 137, Pt. F , N o . 6 , DECEMBER 1990

Theorem 1

can be factorised into (2 log, N) - 1 successive stages

1

&d) = N (22) where N is a positive integer and A N denotes ‘d divides N’.

(8)

The index set of an N-point transformation is (0, 1, 2, . . . , N - 1). If the transform length is power-of-two, denoted

as 2", the index set can be partitioned into m

+

1 subsets, corresponding to

4(d),

where d E {2', 2l, 2',

..

., 2"},

respectively.

Since n E A i and k E A , , it is obvious that gcd (n, N) =

?

for i E (0, 1, . . . , m - 1). Therefore, gcd ( f ( k , n), N) = 2', that is, the valuef(k, n) is still in the set

A i .

First, we prove that G, = { A o ; @,} is a group. The binary operator

8,

is similar to modulo multiplication. If the number is first normalised to the set, say { - N

+

1, - N

+

2,

. . .

, - 1, 0, 1, . . . , N - 1, N} mod 2N, then the

operation of @, is the absolute value of corresponding modulo multiplications. Since 'modulo multiplication group' is a group, it is obvious that the four properties of a group structure are satisfied in G,.

Next, we prove that 3 is a generator of group G, and 3,1Z

=

1 (under 8,)

3N/4

=

N - 1 (under @,) (23)

Eqn. 23 means that 3 is a generator of order N/2. The reason for using N - 1 in the second equation is that (N - 1)BAN - 1) = 1. The proof of eqn. 23 is as follows:

(a) From Euler's function, we have 34"2N'

=

l(mod 2N)

3, l(mod 2N) (b) Since N = 2", we have 3"' - l(mod 2N)

=

l(under 8,)

Therefore, the first of eqns. 23 is verified.

m

>

4 the proof is as follows:

(c) If N = 2", for m < 4 the proof is trivial and for

3 ~ 4 = (2

+

1 y 4 =

y

(y)2*

k = O +2"-1+(22"-3-2m-1)+0 + f(24m-7 - 12 23"-6

+

22 22-4 - 12 2-2)

+

o

+

...

+

q m o d 2""') 1 - 4 x 2"-'(mod 2"'")

=

1 - 2"(mod 2"+')

=

N - l(under @,) (24)

Thus, G, is a cyclic group and 3 is a generator.