A NEW ARRAY ARCHITECTURE FOR PRIME-LENGTH DISCRETE COSINE TRANSFORM

(1)

436 IEEE TRANSACTIONS ON SIGNAL PROCESSlNG, VOL. 41, NO. I , I A N U A R Y 1993 I, 2) are unbiased and statistically independent. The frequency res-

_{A New Array Architecture for Prime-Length Discrete}

olution bandwidths are given by

_{Cosine Transform}

and

L I I

x ( M

+

1)T, M

.

SNR,’ i = 1, 2 (4.1)

B

_‘

=--- Jiun-In Guo, Chi-Min Liu, and Chein-Wei Jen

Abstract-A new approach to derive a systolic algorithm for prime- length discrete cosine transform (DCT) is proposed. It makes use of the input/output (UO) data permutations and the symmetry property of cosine kernels such that the proposed array possesses outstanding

E ( i

- A ) *

= B,?. (4.2) performance in hardware cost of the processing elements (PE’s), av- erage computation time, and the I/O cost.

the asymptotic variances

8 . N B. Sicperresolution

If

1

fl -

fil

<

1 / M T , , but the SNR, are high that

2P(Bl

+

B ? ) < l / M T , (4.3) or, equivalently,

Equation (4.3) means that two sinusoids are resolvable. However, the biases of the AR frequency estimate, a B , / 2 , exist; and the res- olution bandwidths OB, spread by a factor

6 .

And, the statistical variances are given by

C. C h o i c e s of t h e A R Model M and Sample Size N

( I / M

.

SNR,) should be small, at least to meet the resolvability condition ( 4 . 4 ) , and N should be large that ( 1 / N . T , )

<

/ 3 B , / 2 .

ACKNOWLEDGMENT

The author wishes to thank Prof. S . Gutmann of Northeastern University for valuable discussions and suggestions. The comments from the Associate Editor and anonymous reviewers are also deeply appreciated. 111 I21 131 I41 151 I61 171 181 191 REFERENCES

R. T. Lacoss. “Data adaptive spectral analysis method,” Grophysics, E. H. Satorius and J . R. Zeidler, “Maximum entropy spectral analysis of multiple sinusoids in noise,” Geophysics, vol. 43, pp. I I 1 1 - 1 118, Oct. 1978.

H. Sakai, “Statistical properties of AR spectral analysis,” IEEE Tram. Acoust., Speech, Signal Processing, pp. 402-409, Aug. 1979. S . W. Lang and J . H. McClellan, “Frequency estimation with maximum entropy spectral estimator,” IEEE Trans. Acousr., Speech, Six-

r 7 d Processing, vol. ASSP-28, pp. 716-724. Dec. 1980.

S . L . Marple. “Resolution of conventional Fourier, autoregressive, and special ARMA methods of spectrum analysis,” presented at the IEEE Int. Conf. ASSP. Hartford, CT, 1977.

S . Haykin, Ed., Nonlinear Methods ofSpecrral Analysis. Berlin Hei- delberg, New York: Springer-Verlag, 1979.

S . M. Kay and S . L. Marple. Jr.. “Spectrum analysis-a modern per- spective,” Proc. IEEE, vol. 69. pp. 1380-1419, Nov. 1981.

D. G . Childers. Ed., Modern Specrrurn Analysis. New York: IEEE,

1978.

S . B. Kesler, Modern Specrrutn Analysis, I I . New York: IEEE. 1986. vol. 36, pp. 661-675, Aug. 1971.

I. INTRODUCTION

The discrete cosine transform (DCT) has been widely used in image coding for its near-optimal performance [ l ] . Since the D C T

is computation intensive, the development of high-speed hardware is necessary in many real-time applications. Systolic arrays are an appropriate architecture to meet the requirements of both high processing speeds and VLSI implementation. However, the computing

algorithms encapsulated within systolic arrays need to be developed specifically.

Recently, there were some systolic array architectures [2]-[6] proposed to realize one-dimensional D C T . These architectures can be categorized into linear array architectures [2]-[4] and two-dimensional array architectures

[SI,

161.

Although the two-dimensional arrays can attain higher speeds than one-dimensional arrays, the hardware complexity of PE’s and the control complexity of these two-dimensional arrays are generally higher than those of linear arrays. Furthermore, the two-dimensional arrays need high I/O bandwidth and a large number of I/O channels to attain the higher speeds, unless most operands are preloaded into the arrays instead of being supplied from the input ports. But additional overheads are needed if the operands are preloaded into the arrays like the two-dimensional array in 151. Considering for example the array in [6], the average computation time for N-point D C T is

( A

+

2 ) cycles, while the number of multipliers in the array is (4N

+

4

A),

if the clock cycle is assumed to be the consumption time of one multiplier. In addition, undesirable features such as the com-

plex control problems, high I/O bandwidth, and a large number

of I/O channels are still accompanied with the array in [6]. The attractive feature of linear arrays is that the U0 bandwidth and the number of I/O channels can be kept independent of the DCT length if the I/O channels exist only at the two extreme ends of a linear array. As discussed in [SI, the high U0 bandwidth required for most systolic arrays would limit computing speeds. Hence, linear arrays should be one feasible architecture for a sys-

Manuscript received January 18, 1991; revised November 4, 1991. Part of this correspondence was presented at the IEEE Workshop on Visual Signal Processing and Communications, June 6-7, 1991. This work was supported by the National Science Council under Grant NSC80-0404-E009- 39.

J.-1. Guo is with the Department of Electronics Engineering, Institute of Electronics, National Chiao Tung University, Hsinchu, 30039, Taiwan, Republic of China.

C.-M. Liu is with the Department of Computer Science and Information Engineering, National Chiao Tung University, Hsinchu, 30039, Taiwan, Republic of China.

C.-W. Jen is with the Department of Electronics Engineering, Institute of Electronics, National Chiao Tung University, Hsinchu, 30039, Taiwan, Republic of China.

IEEE Log Number 9203370. 1053-587X/93$03.00 0 1993 IEEE

(2)

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 41, NO. I , JANUARY 1993 431

tem application. However, how to keep I/O channels at the extreme ends of linear arrays and to pursue high computing power at the same time should be a challenging design issue when deriving sys-

where { y(i)/i = 0, 1 ,

. . .

, N - I } is the input sequence and {Y(k)lk = 0, 1,

. . .

, N - 1) is the output sequence. We represent (1) as a matrix-vector multiplication as follows:

-

1 1 1 1 1 1 I

cos ( a ) cos (3a) cos ( 5 a ) cos (7a) cos (9a) cos ( I la) cos ( 1 3 4

cos (2a) cos ( 6 a ) cos (loa) cos (14a) cos (IOU) cos (6a) cos ( 2 a ) cos (3a) cos (9a) cos ( 1 3 4 cos (7a) cos ( a ) cos ( 5 a ) cos ( I la) cos ( 4 4 cos ( 1 2 a ) CO5 ( S a ) 1 cos ( S a ) cos (12a) cos (4u) cos (5a) cos (13a) cos (3a) cos (7a) cos ( l l a ) cos (a) cos ( 9 a ) cos (6a) cos (loa) cos ( 2 a ) cos (14a) cos ( 2 a ) cos ( l o a ) cos ( 6 a )

-

tolic algorithms for linear arrays. The approach in [ 2 ] is to directly represent the D C T as a matrix-vector multiplication first. Then, the systolic array realization for the matrix-vector multiplication can be directly modified to compute the DCT. Since the designed array in [ 2 ] cannot retain the I/O channels at the two extreme ends of itself, a large number of I/O channels and high U0 bandwidth are needed. Another approach [3] modifies the D C T into a form similar to the discrete Fourier transform (DFT) and realizes the DCT by using the array that has been developed for the D F T . Since the twiddle factor exp ( j 2 7 r l N ) in the D F T is a complex number while the factor cos (27r/4N) in the D C T is a real number, the designed arrays based on this approach should induce much hardware cost. In addition, the approach in [4] is also to represent the DCT as a matrix-vector multiplication like [ 2 ] , but it generates the transform kernels recursively in the array instead of prestoring them in memory. The array in [4] uses this method to reduce the I/O cost such a s the number of U0 channels and I/O bandwidth, but additional hardware cost is paid for recursive generations of the cosine kernels.

T o simultaneously consider the hardware cost, the IiO bandwidth, and the number of I/O channels, a systolic algorithm for prime length D C T is derived in this correspondence. The design approach utilizes the input and output data permutations accompanied with the symmetry property of the cosine kernels such that the proposed array can retain most I/O channels at the two extreme ends and simultaneously attain good performance in average com- putation time, hardware cost of the PE’s, and the number of the

PE’s. The performance of the proposed array and that of the linear arrays in [2]-[4] are discussed in Section 111. From Section 111, we can see that the proposed array possesses better performance than the arrays [2], [3] in the hardware cost of the PE’s, the average computation time, the number of U0 channels, and the IiO band- width. Moreover, it also possesses better performance than the ar- ray [4] in the hardware cost of the PE’s. The overheads of the proposed array include some additional shift registers, latches, multiplexers, a demultiplexer, and a switching element for solving control problems. Basically, these overheads are minor as com- pared with the savings in regard to the hardware cost of the PE’s in the array. This correspondence is arranged as follows. Section

I1 describes the derivation of the computing algorithm encapsulated

in the array. Section I11 considers the array realization of the proposed systolic algorithm. A brief conclusion is given in Section IV.

11. T H E ALGORITHM DERIVATION The D C T is defined as N - I ~ ( k ) =

C

y(i) cos r = O f o r k = O , I ; . . . N - 1

where “a” denotes s / 1 4 ; and N is assumed to be 7. If ( 2 ) is directly realized by linear array architectures, as was done in [ 2 ] , there would be one input port needed in every PE to transmit the cosine kernels for proper operations, and would induce a large number of I/O channels and high I/O bandwidth. It can be shown that the D C T defined in (1) can be formulated a s

Y(k) = { 2 T ( k )

+

x(0)) cos

I&],

f o r k = O , l ; . . , N - 1.

(3)

where

and x ( i ) is another sequence defined as x(N - 1 ) = y(N - 1)

x ( i ) = y(i) - x ( i

+

1) f o r i = 0, I ,

. . .

, N - 2. ( 5 ) If N is a prime number, there exists some number of “ g , ” not necessarily unique, such that there is a one-to-one mapping from integers { i j i = 1 , 2 ,

. . .

, N - 1 ) to integers { j l j = 1, 2 ,

. . .

, N - l ) , given by

J = I K ‘ I N ( 6 )

where \AIN denotes the result of “A-modulo-N” operation. Then (4) can be reformulated with i and k as powers of the primitive element “g.” Because i and k take on the value zero, and zero is not a power of “ g , ” the zero frequency component must be treated specially, i.e.,

A’- I

Y(O) =

C

y ( i )

,

= 0 (7)

Y(k) = { 2 T ( k )

+

x(O)} cos

1

3

f o r k = I ; . . , N - l . ( 8 4 where

N - I

T(k) =

c

x ( i ) cos

I$],

f o r k = 1,

. . . ,

N - 1 . (8b)

I = I

Applying ( 6 ) to (Sb), it follows that

(3)

438 _{IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL.}41. NO. I , JANUARY 1993

The term

"I

g'

IN

x

1

g

kIN"

can be expressed as

1

& I N X

I

g k I N =

1

g ' + ' l N

+

m x N, i , k = l , 2 ; . . , N - l where "m" is a n integer. Then, (9a) can be written as

N - I T'(k) = T(I g k l N ) = x ' ( i ) X C;, , = I k = 1 , 2 ,

. . .

,

N - 1 where and x ' ( i ) = x(l g'

I N )

Applying (1 1) to (9c), (9c) can be written as

( N - 1 ) / 2

(9b) T'(k) = I = I x " ( i ) x C ; , k = 1, 2, *

. .

, N - 1. (12a)

where

x " ( i ) =

if mi and m2 are one even number and one odd number

if ml and m2 are all even numbers or all odd numbers

Now (7), @a), and (9c) constitute the computational equations for

the DCT. To see the difference between these computational equations and ( I ) , (9c) is written as

where "a" denotes 1 r / 7 , N a n d "g" are assumed to be 7 and 3, respectively. It can be seen that the absolute values of the cosine kernels along same antidiagonal positions in the matrix of (10) are the same while those in the matrix of (2) d o not have any specific order like (IO). This phenomenon tells that the vector of T'(k) is the circular convolution of inputs x' (i) and the cosine kernels. The phenomenon also exists in the DFT, which was firstly found by Rader [IO] and has also been used to design the efficient systolic arrays for prime length DFT [9]. Now we apply it to derive the systolic algorithm for DCT. From the viewpoint of array realization, the constant value along the same antidiagonal positions means that this variable can be sent to every PE along a link from one input port at the extreme end of a linear array. The (2N - 3) antidiagonal lines in the matrix of (10) mean that there are only (2N - 3) values instead of N 2 values in the matrix of (2) needed to be sent to the array. This phenomenon can be effectively captured to design the systolic array with a low number of I/O channels and low I/O bandwidth.

From (IO), since cos ( k a / N ) = -cos ((N - k ) a / N ) , it is observed that the absolute values of the cosine kernels located at the left three columns are the same as those located at the right three columns. This symmetry property benefits further reduction of the computational complexity in the algorithm. As shown in the Ap- pendix, the symmetry property of the cosine kernels can be expressed as the following equation:

and

N - 1

j = 1,

. . .

9- - , and k = l ; . . , N - l

L

The integers m l and m2 are determined in the following equations:

N - I

and i = I;.. '

-

2

k = 1 , 2 ; . . , N - I ,

where

I

g n + k l N +

1

g r f A + ( N - I l / Z I N - - N.

Now (7), @a), and (12) constitute the computational equations of

the DCT in the proposed algorithm. Considering the computational complexity, the number of multiplications has been reduced from (N - 1)2 in (9c) to (N - 1 ) 2 / 2 in (I2a). In addition, the vector of

T' (k) in ( 1 2 4 is still in a circular convolution form. It will be shown in the next section that such a form is beneficial to the reduction of I/O cost.

(4)

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 41. NO. I , J A N U A R Y 1993

x’(l)tx’(4) Wtx’(5) ~‘(3)+~‘(6)

Fig. l(a). The dependence graph (DG) of the proposed algorithm for 7-point DCT where “a” denotes ~ / 7

111. T H E A R R A Y REALIZATION

This section considers the array realization of the proposed systolic algorithm. Fig. 1 shows the dependence graph (DG) [12] of

the proposed algorithm for a seven-point DCT. The DG clearly shows the data operations, data dependency, and control signals involved in the proposed algorithm. Linear arrays can be constructed from the DG according to the design procedure [12]. And the tag control scheme [ 131 can be utilized for the I/O control and data control. Based on the two design approaches, Fig. 2 shows the constructed array for seven-point DCT with projection vector [0 11. For the sake of showing the activity of the array clearly, we rewrite (7), (8a), and (12a) in recursive forms as

z ;

= x(0)

zb

=

z&’

+

2 X [ x ’ ( i )

+

x ’ ( i

+

3)], i = 1, 2, 3. Y ( 0 ) =

z ;

Y ’ ( k ) = {2T’(k)

+

x(0)) x cos

(5

( 3 k

I,),

, 6. k = 1,

. . .

y ; = 0 y ; = y i - ’

+

x ” ( i ) x

ci,

i = 1 , 2 , 3 , k = l ; . . , 6. (13c) T ( k ) = y : where

and “y;” and “zb” are the intermediate results.

From Fig. 2(a), we know that the operations specified in (13a) and (13b) are computed within the left-most PE, while those in (13c) are computed in other PE’s. The multiplication and addition

y * #

C’

x3’x4. x l ’ x2’ c’ C= c ;

xl’ <= X l ;

x2’ <= x2 ; elseif sign=01 then x3’<=x1; elseif sign=lO then

If sign=00 then y’c=y+xl’C ; If T a g l = l then y‘<=y-xl’c ; x4‘c=x2; y’c=y+X2’C ; else else x4‘<=x4; end x3-<=x3; y’<=y-XZ’c ; end x 3

Ylo<=(zy+xl)’c ; If TagZ=l then If T a g l = l then Yloc=r’ ;

z’<=xl+Zx3 ; else z’<=z+2x3 ; end else y20<-0 ;

end

Fig. I(b). The functions of nodes.

439

constitute the main functions of the PE’s, which are shown in Fig. 2(b). And three control signals denoted as “ T a g l , ” “Tag2,” and “sign” are used to select the right operands in the operations. Fig. 2(c) shows the preprocessing stage needed in the array. The intermediate sequence x ( i ) can be generated from input sequence y ( i ) by a subtractor, and then we use the multiplexers and a switching element to permute the sequence x ( i ) where the required control signals can be generated by circular shift registers. Finally, the required data patterns are obtained by adding and subtracting the permuted data. Fig. 2(d) shows the postprocessing stage in the array, which uses a demultiplexer to perform the output data per- mutation. Similarly, the control signals needed in the demultiplexer can be generated by a circular shift register. The utilization of shift registers and latches in Fig. 2(c) and Fig. 2(d) makes the array able to be pipelined. That is, the intermediate signals x ( i ) and output results Y ( k ) of current block are shifted into the shift registers seriously. After all of these ( N - 1) values have been

(5)

440 s i g n = l O Y l + Y l + X i n l ' C 1 Tag1=' ~ e x t 2 * ~ 1 T a g l =O

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL 41, NO. I . JANUARY 1993

\ s i g n z 0 0 s i g n = 0 1 s i g n = l l X e x t l 'C1 X e x t l 'C1 Xext2 C1 X i n l ' C 1 X i n l ' C 1 XinS'C1 1 Y l + Y l - Y1; Y l + Y l - Y l - 10 11 m m m 10 10 m m 10 m 10 1.8 0 0 t.7 0 0 1.6 0 t=5 ~'(6)+~'(3) ~'(3)-~'(6) 1-4 X'(5)+~'(2) x:(2)-x'(5) 1.3 ~'(4)+~'(1) x (1)4(4) 1 Y1 o e- ( 2 y l +Xext3)'C2 If T a g l = l then else y 2 ' ~ 2 ' C- c- y 2 + 2 X e x t i Xext3+2Xextl Y l O 4-

-

XextZ Y l If T a g E l then Y2o

+

y 2 4- T a g l e l s e Y2o 4--- 0 e n d C 1 ' C C 1 X e x t l '

+

X e x t l T a g l ' e T a g l X e x t Z ' C X e x t Z If T a g l = l then X e x t l ' X e x t l X i n l '

+

X e x t l Xin2' +- Xext2 XextZ' C 1 ' v i ' control

r

0 0 0 1 1 0 1 1 1 1 1 1 circular SR 1.6 1. 5 1. 4 1. 3 1.2 1. 1 1.9 1. 8 1. 7 1-6 t:5 1 :4 1.3 1 2 1. 1 Control If controk0 then else end U'<= U ; v.<= v ; U'<= v ; V'<= U ;

Fig. 2. (a) The array architecture for 7-point DCT where " r ( )" denotes cos ( ) and " U " denotes ~ / 7 . (b) The functions of

the PE's in the array. (c) The preprocessing stage in the array where SR denotes shift register, SE denotes switching element,

(6)

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 41, NO. I , J A N U A R Y 1993 44 I

0 1

0 1 0 Y(5)

Y(1)

Fig. 2. (Conrinued) (d) The postprocessing stage in the array where SR denotes shift register and L denotes latch.

shifted into the registers, they are shifted parallelly into the latches for the I / O data permutations such that the data of next block can be continuously shifted into the registers without any time delay. Therefore, the proposed array including the preprocessing and postprocessing stages can be fully pipelined, and a high throughput rate of the design can be attained.

In order to see the features of the proposed array more clearly, (12a) is expressed as

1

-

--COS ( 2 ~ ) COS ( 6 ~ ) COS (4a) COS ( 6 ~ ) COS ( 4 ~ ) --COS ( 5 ~ ) cos (4a) -cos (5u) -cos (a)

-cos (5u) -cos ( a ) -cos (3a) cos ( a ) -cos (3a) cos (2a)

L cos (3a) cos (2a) cos (6a)

x’ ( I ) f x’ (4) x ’ ( 2 )

*

x’(5) x ’ ( 3 ) k x ’ ( 6 )

where “a” denotes * / I , N a n d “g” are assumed to be 7 and 3, respectively. If “ k ” is equal to 1 , 5 , and 6, the minus signs in the

values instead of eight to the array for computing each seven-point DCT. It can be seen from the array in Fig. 2(a) that only ( N - 1) cosine kernels are needed to compute an N-point DCT. And, the average computation time for computing the N-point D C T is ( N -

1) cycles. This phenomenon is induced from the cyclic property of Exerting the specific order of the cosine kemels in the matrix of (14). these kernels in the array are imported from the right-most PE instead of being imported from every PE a s the approach in [2]. Therefore, the proposed array requires a low number of I/O channels and low 1 / 0 bandwidth. Considering the I/O cost, the I/O cost of the designs [2]-[4] are proportional to (N

+

2 ) L [2], (N

+

3 ) L [3], and 8L [4] where L is the wordlength. And, the I / O cost of the proposed array is only proportional to 7 L

+

N

+

2. Also,

the proposed array needs much lower hardware cost than the designs (21-[4]. The required numbers of multipliers are N [2], 4N

+

4 [3], and 2N - 2 [4], which are much larger than the ( N

+

1)/2 of the proposed array. Moreover, regarding t o the average computation time, the proposed array needs (N - 1) cycles for computing N-point D C T , which is better than the N cycles in (21, and also better than the (N

+

1) cycles in [3]. The hardware over-

heads of the proposed array include some shift registers, latches, multiplexers, a demultiplexer, and a switching element for solving the control problems and the I/O data permutations. And the cycle time of the array includes the multiplication and addition time a s well as the time for multiplexing. However, these overheads are minor as compared with the savings of hardware cost in the proposed array. As a whole, the proposed array excels the arrays [2], [3] in average computation time, hardware cost of PE’s, the number of I/O channels, and the I/O bandwidth. It also excels the array (41 in hardware cost of the PE’s.

the modulo operation in (6), i.e.,

1

g ’ I , v =

I

g N - ’

- ‘

I N .

IV. CONCLUSIONS

In this correspondence, a new approach to derive the systolic algorithm for prime length D C T is presented. This approach in- duces the array to have good performance in hardware cost of PE’s, average computation time, the number of I/O channels, and the I/O bandwidth. Also, this design approach can be similarly applied to derive the systolic algorithms for discrete sine transform (DST) and discrete Fourier transform (DFT) [9]. Although the proposed systolic algorithm and array are derived under the restriction that N is a prime number, they can be applied to the nonprime !ength DCT by appending the input data from nonprime length to prime length at the expense of some overheads in hardware cost and average computation time. With these overheads, the hardware cost of the proposed array is still lower than that in the arrays (21-[4]. However, it is not always a drawback that N is a prime number. It is known that the blocking effect will occur in the D C T as applied to image coding with low bit rate. And the overlapping method is one of the remedies for this problem [ 1 11. Applying the proposed algorithm to the nonprime length D C T by using the overlapping method can also reduce the undesirable blocking effect.

APPENDIX input vector are valid. Otherwise, plus signs are valid. As shown

for computing the N-point DCT. And C = {cos ( 2 a ) , cos ( 6 a ) , cos

in (14), there are ( 3 N - 5 ) / 2 values needed to be Sent to the array In the Appendix, the proof Of ( l

‘1

is given’ At first, ( l ’) is “- written here as

1

g ’

I N

1

= cos

1

(N -

1

g ’ + “ - ‘ I / ’

IN)

- g ’ + ‘ ” - I ) / ?

I

;

(4a), cos ( 5 ~ ) . cos ( a ) , cos (3a), cos ( 2 a ) , cos ( 6 a ) ) is the sequence of these eight values for the seven-point DCT. It is observed that the last two cosine kemels are identical to the first two cosine kernels in

C.

And these common cosine kemels can be shared for computing two neighboring blocks successively. As many image blocks are processed continuously, it is only necessary to send six

- ~ cos -

(7)

442 _{IEEE TRANSACTIONS ON SIGNAL PROCESSING. VOL.}_41,_NO._I,_{JANUARY 1993}

The necessary and sufficient condition that ( A l ) holds is image coding,” in Proc. fCASSP 83, Boston, MA, 1983, pp. 1212- 1215.

[I21 S . Y . Kung. VLSI Array Processors. Englewood Cliffs, NJ: Pren- tice-Hall, 1988, Chapters 3 and 4, pp. 110-282.

[I31 C. W. Jen and H. Y . Hsu, “The design of a systolic array with tags Input,” in Proc. ISCAS, Finland, 1988, pp. 2263-2266.

I

N .

I

g ‘ I N

= N -

I

g T + ( N - 1 ) / 2

That is

IN = N (A21

1

+

I

g ‘ + “ - 1 1 0

where “g” is a primitive element. According to the number theory [7], we have

1

g ( N - I l / 2 I N = - 1;

I A

= 1 I A l N l B I N I N

Utilization of Bandpass Filtering for the Matrix

Pencil Method

then

Fengduo Hu, T . K. Sarkar, and Yingbo Hua

=

II

g f l N x ( N

-

1)IN =

1 - 1

g ‘ l N l N

Abstract-This correspondence describes an alaorithm named the

= IN -

I

R ’ I N I N

a s 0

<

I

g ’ I N 5 N

-

1, 1 5 i 5 N - 1, we have IN -

I

g ’ l N l N = N -

I

g ’ l N . It means that

bandpass matrix pencil (BPMP) method for estimating the parameters of an exponential data sequence. The matrix pencil (MP) method, along with a filtering technique, is used to estimate the complex exponentials of the signal. However, due to special requirements to the filtered data by the MP method, the prefiltering process is not trivial. The approach presented here utilizes the backward process for the IIR filtering and the circular convolution for the FIR filtering. resoectivelv. Monte Carlo I _

simulations are presented to illustrate the performance of the proposed filtering schemes. IN = IN -

I

d I N I N = N -

I

g ’ h .

I

g l + ( N - I ) / 2 so

1 g ‘ I N

+

I

g r + ( N - I l 1 2

I N

= N . I. INTRODUCTION Therefore, ( 1 1) is proved. ACKNOWLEDGMENT

The authors are very grateful to the reviewers for their construc-

The mathematical model of an observed signal can generally be formulated as M y ( k ) = x ( k )

+

n ( k ) =

c

R , Z f

+

n ( k ) , , = I k = 0 , 1,

. . .

, N - 1 ( 1 ) comments. REFERENCES

N. Ahmed, T. Natarajan, and K. R. Rao, “Discrete cosine transform,” IEEE Trans. Comput., vol. C-23, pp. 90-93, Jan. 1974.

U . Totzek and F. Matthiesen, “Two-dimensional discrete cosine transform with linear systolic arrays,” in Proc. Int. Con& Systolic Arrays, Ireland, 1989, pp. 388-397.

N. I. Cho and S . U . Lee, “DCT algorithms for VLSI parallel imple- mentations,” IEEE Trans. Acoust., Speech, Signal Processing, vol. 38, no. I , pp. 121-127, Jan. 1990.

L. W. Chang and M. C. Wu, “A unified systolic array for discrete cosine and sine transforms,” IEEE Trans. Signal Processing, vol. 39, no. 1, pp. 192-194, Jan. 1991.

C. Chakrabarti and 1. Ja’Ja’, “Systolic architectures for the computation of the discrete Hartley and the discrete cosine transforms based on prime factor decomposition,” IEEE Trans. Comput., vol. 39, no. M. H. Lee, “On computing 2-D systolic algorithm for discrete cosine transform,” IEEE Trans. Circuits Syst., vol. 37, no. 10, pp. 1321-

1323, Oct. 1990.

Shu Lin and Daniel J. Costello, Jr., Error Control Coding: Funda- menrals and Applications. Englewood Cliffs, NJ: Prentice-Hall, 1983, Chapter 2, Section 2.2, pp. 19-24.

A. L. Fisher and H. T. Kung, “Special-purpose VLSI architectures: general discussions and a case study,” in VLSI and Modern Signal Processing,” S . Y . Kung e r a l . , Eds. Englewood Cliffs, NJ: Pren- tice-Hall, 1985, Chapter 8, pp. 154-169.

C. M. Liu and C. W . Jen, “A new systolic array algorithm for discrete Fourier transform,” IEEE Trans. Comput., 1990, submitted for publication, also in Proc. Int. Symp. on Circuits and Systems, Sin- gapore, 1991.

C. M. Rader, “Discrete Fourier tranforms when the number of data samples is prime,” Proc. IEEE, vol. 56, 1968, pp. 1107-1 108. H. C. Reeve, 111, and J. R. Lim, “Reduction of blocking effect in 11, pp. 1359-1368, NOV. 1990.

where

z,

= exp ( - a ,

+

jw,) ( 2 )

and

z,’s

and R,’s are the poles and residues of the signal, respec- tively.

M

is the number of poles of the signal, and n ( k ) is the back- ground noise. a, and w, are the damping factor and angular frequency of the i t h sinusoid, respectively. Once the number of poles and their values have been determined, the residues at the poles can be found by the least squares method. Hence, only the problem of estimation of the poles is considered in this correspondence.

The most popular method for pole retrieval is Prony’s method. However, Prony’s method is notorious for its extreme sensitivity to noise. There are many modified versions of the Prony method. The most well known one is the principal eigenvector (PE) method 111. Recently, Hua and Sarkar 121, [3] developed a new technique, named the matrix pencil (MP) method, for pole estimation. The

advantage of using matrix pencil is that the signal poles can be found directly from the eigenvalues of the matrix contrast to the PE method, which generally requires two-step processes. In the first step one solves a matrix equation, and finds the roots of a polynomial equation in the second step.

Manuscript received August 10, 1989; revised October 24, 1991. F. Hu is with Entropic Speech Inc., Cupertino, CA 95014.

T. K. Sarkar is with the Department of Electrical and Computer Engi-

Y, Hua is with the Department of Electrical Engineering, University of IEEE Log Number 9203378.

neering, Syracuse University, Syracuse, NY 13244.1240. Melbourne, Parkville, Victoria, Australia 3052.