A Separable Cross-Entropy Approach to Power Spectral Estimation

(1)

IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SlGNAL PROCESSING. VOL. 38. NO. I . JANUARY 1990 I os

A Separable Cross-Entropy Approach to Power

Spectral Estimation

Abstract-We present a new approach for power spectrum estima- tion based on a separable cross-entropy modeling procedure. We start with a model of a multichannel, multidimensional stationary Gaussian random process which is sampled on a nonuniform grid. An approxi- mate separable model is then fit to this, in which selected frequency samples of the process are modeled as independent random variables. Two cross-entropy-like criteria are used to select optimal separable ap- proximations. One of our methods yields a spectral estimation algo- rithm which is a generalized version of Capon’s MLM method, and the other is similar to classical windowing methods. We conclude with a discussion of different strategies for designing bandpass filters for use with this method.

I. INTRODUCTION

HERE are a large number of applications in which it

T

is necessary to estimate the power spectrum of a stationary random process given samples of the covariance kemel. The simplest approach is to form the periodogram by computing the magnitude squared of the Fourier Transform of a set of windowed data samples. Unfortu- nately, the periodogram yields inconsistent and biased spectral estimates. Bias is introduced by the data windowing process, and the variance of each frequency sample remains constant regardless of the number of data samples

[l]. Blackman-Tukey [2] suggested a simple method for improving the variance of the estimate by windowing the correlations of the data before Fourier Transforming. A similar effect was achieved by Welch [ 3 ] and others by averaging multiple periodograms formed from overlap- ping short frames of data. In both cases, the methods trade off resolution for decreased variance by using a window to average the power across a number of frequency components. The shape and width of the window controls the tradeoff between resolution and variance.

More modem methods use precise objective functions or models of the underlying random process to achieve higher resolution, smoother estimates, and lower variance. Burg’s Maximum Entropy method (MEM) (41

Manuscript received May 6 , 1985; revised April 4 , 1988. The work of

C.-Y. Liou was supported in part by the Technology Assessment Research Program of the Minerals Management Service. The work of B . R. Musicus was supported in part by the Advanced Research Projects Agency moni- tored by ONR under Contract N00014-81-K-0742 NR-049-506 and in part by the National Science Foundation under Grant ECS-8407285.

C.-Y. Liou was with the Research Laboratory of Electronics. Massa- chusetts Institute of Technology. Cambridge, MA. He is now with the Na- tional Taiwan University, Taiwan.

B. R. Musicus is with the Research Laboratory of Electronics. Massa- chusetts Institute of Technology. Cambridge. MA 02 139.

IEEE Log Number 893 169 I .

chooses a power spectrum estimate to maximize the entropy of the spectrum subject to the constraint that the spectrum must match the first few correlations of the data. Given one-dimensional, uniformly sampled correlations, the resulting spectrum is all-pole, and can be calculated using various linear prediction methods. An alternative approach is Capon’s Maximum Likelihood Method (MLM) which designs an optimal Finite Impulse Re- sponse (FIR) bandpass filter at each frequency to pass that frequency unmodified but reject as much other signal en- ergy as possible. The output power of the filter is used as the estimate of the power spectral density at that frequency. This spectral estimate can be shown to have the form of an all-pole model, although the coefficients are different than those of MEM. Numerous other interpre- tations of MLM exist in the literature. Marzetta [SI, for example, shows the MLM gives the upper bound on the amplitude of each frequency component given the known correlations. MLM is easy to compute even for nonuniformly sampled or multidimensional data, while MEM is quite difficult for these cases. The resolution of MLM, however, tends to be lower than that of MEM [6].

In this paper, we consider a very different approach to the problem of power spectrum estimation. We start by noting that a stationary Gaussian random process can be loosely characterized as having a Fourier Transform which is a Gaussian white noise process. As the amount of data grows asymptotically large, samples of the transform approach independent zero-mean Gaussian random variables with variance equal to the power spectrum. Our new approach to spectral estimation starts by fitting an ap- proximate model to the given data, in which we assume that a selected set of up to N frequency samples are in- dependent random variables. The best such separable model is chosen by minimizing one of two discriminant or cross-entropy functions. The approach used is pre- cisely that developed by Musicus [7], but applied to the spectral estimation problem. One of our methods yields a generalized form of the MLM method, and the other is similar to a classical windowing approach. Examples are shown demonstrating the performance of these techniques.

Our presentation is somewhat indirect. We will start by discussing the use of cross-entropy to build model approximations in which variables are forced to be independent. After discussing various special cases of the results, we will consider the power spectrum estimation problem. 0096-3S18/90/0100-0105$01 .OO

0

1990 IEEE

(2)

106 IEEE TRANSACTIONS ON ACOUSTICS. SPEECH. A N D SIGNAL PROCESSING. VOL. 38. NO. I, JANUARY 1990

11. GAUSSIAN ESTIMATION PROBLEM 111. SEPARABLE CROSS-ENTROPY MODELING

Suppose we are given a finite set of L jointly Gaussian real or complex valued random variables, x I ,

.

* * 7 X L ,

each a vector of length N j

where E [ ] is the expectation operator, Cov

[ I

is the covariance, and x H = x P is the complex conjugate trans- pose. is an NI X NI matrix. It will be convenient to define variables

where x and m , have length N = C N I , and R, is an N x N conjugate symmetric positive-definite covariance matrix. Then the probability density of x is Gaussian with mean

m ,

and covariance R ,

( 2 . 3 )

L e t y , , * *

. , y p be a set of P vectors, each of length

M I , which are formed from linear combinations of the xJ

p ( x l , * * , x L )

=

p ( x ) = N ( m , , R , ) .

L

y , =

c

T,,/xI

J = I

The y , are jointly Gaussian. Let

for i = 1,

.

*

-

, P . ( 2 . 4 )

where y has length M = C M i , and T has dimensions

M x N . Assume that M I N , and that T has rank M .

Then y is Gaussian with mean m,. and positive definite covariance R,

P(Yl> *

-

.

7 Y P )

=

P ( Y ) = N ( m , . , RJ

In general, the components { y i } will be correlated Gaussian random variables. Suppose, however, that the components { y i } are “close” to being independent in the sense that the off-diagonal blocks of the R , covariance matrix are “small” compared to the diagonal blocks. It might be computationally and analytically convenient in this case to find an approximation to the Gaussian model for y in which the components { y i } are treated as independent random variables, but in which the shape of the probability density is otherwise about the same. More for- mally, we will try to find a separable probability density model q ( y ) in which the components { y i } are indepen- dent random variables

4 ( Y ) = q l ( Y l ) 4 2 ( Y 2 ) * * * d Y P ) . ( 3 . 1 ) Our goal is to find a separable q ( y ) which optimally ap- proximates the original p ( y ) in some manner.

Kullback’s two discriminant functions 181 are good cri-

teria for measuring the goodness-of-fit between p and 4. The first discriminant function is equivalent to the cross- entropy criterion suggested by Shore and Johnson 191 and developed for this class of problems by Musicus [7]. The second function is just the reverse of the first. We will develop these two approaches in the next two sections, and then compare the results.

A . Cross-Entropy (CE) Method

Shore and Johnson considered the following problem. Suppose that we are given an U priori estimate p ( ) of a probability density for some unknown

a ,

but that we are not very certain of its validity. Suppose that new observation data arrive which suggest that this estimated density is incorrect, and that the actual density must belong to some set Q. Shore and Johnson started with four in- variance and consistency axioms which, they argued, any reasonable estimation procedure must obey. These insist, for example, that the same answer must result regardless of the coordinate system chosen for the unknown. Shore and Johnson were then able to prove that any estimation procedure which satisfies these axioms must yield the same estimate as that found by minimizing the cross-entropy

H(

q , p ) over the set q E Q

H ( q , p ) = 4 ( a ) log

4O

d a .

P ( a >

where

We will call this the “Cross-Entropy’’ Method (CE), and will denote the solution by

qc-.

For our problem, Q is the set of separable densities, and so

m, = Tm,

R, = TR,TH. (2.6) n

It will be convenient to partition R , into blocks [ R , ] , , , of 4 C E + min

1

4 l ( Y l ) * * * 4P(YP)

91 . Y P

size M I x M J , corresponding to the covariance of y , and

(3)

LlOU A N D MUSICUS. SEPARABLE CROSS-ENTROPY APPROACH 107

We show in Appendix A that this minimization problem has a unique solution which has the form

Only the second term in (3.8) depends on q. This term can be further decomposed, reducing the minimization

P

f o r i = 1 ,

- .

, P . (3.11)

With this objective function, therefore, the separable function simply models each component as having the

i =

rI

I ~ [ R ; ~ l ; , ; ~

( 3 . 6 ) IRJ'I

H ( 4 C E 9 P ) = E log

where

1

R,'

1

is the determinant of R ; ' . Note that H ( q , p )

is always'positive, and attains its absolute minimum value of 0 if and only if q = p . Thus, the minimum cross-en-

tropy value will be zero if and only if the original probability density is already separable. If the minimum H ( i j C E , p ) is close to 0, then qCE is a good separable ap- proximation to p . If the minimum H ( q C E , p ) is large,

however, then the y i are strongly correlated and the best

separable approximation is still quite poor.

B. Reversed Cross-Entropy Approach

A related approach which gives quite a different result

is to reverse the order of the arguments in the cross-entropy function, choosing

4

to minimize H ( p , q )

same mean and covariance as the original marginal density.

The value of the reverse cross-entropy function (3.7) at

this solution is

P

Once again, this will be zero if and only if p is already

separable. If the value is close to 0, then

GRCE

is a good separable approximation to p . Otherwise, the y i are not

modeled well as independent random variables.

C . Comparison of the Techniques

It is interesting to compare the behavior of the separable approximations for these two methods. If the original density p were separable,

ORCE min

s

P ( Y l ' *

.

7 Y P )

91,' " . 9 ~

We will call this the ''Reverse Cross-Entropy " (RCE)

Method, and will denote the solution as

GRCE.

This minimization is considerably easier to solve than that in (3.3).

We can write (3.7) in the form

P

P ( Y > = P ( Y l ) * * P ( Y P > , (3.13) then both methods would simply set

qCE

= QRCE = p , and

the value of the cross-entropies would both be zero,

H (

dCE,

p ) = H ( p ,

gRcE)

= 0. In general, however, when

p is not separable, then the two methods will give some-

what different answers, and the cross-entropy values will be somewhat different. The CE method (3.5) sets each

component of the density q, to the conditional probability

density of y , given that the remaining components are set

equal to their a priori means. The RCE method (3.11)

sets each component density to the marginal density of y ,

.

In general, with correlated y , , the CE density estimates

will be "narrower" than the RCE estimates. In fact, it is straightforward to show through matrix manipulation that the CE covariance estimate is always smaller than the RCE covariance estimate

( 3 . 1 4 )

Both methods have a similar structure; the primary difference is that CE bases its estimates on diagonal blocks

(4)

108 IEEE TRANSACTIONS ON ACOUSTICS, SPEECH. A N D SIGNAL PROCESSING. VOL. 38. NO. I . JANUARY 1990 of the inverse covariance matrix, [R,-l]l-ll, while RCE

uses diagonal blocks of the covariance matrix, [ R , ] , , , .

This suggests a simple duality between these methods: the $(-E and qRCE approximations to p ( y ) = N ( m, , R, ) are equal respectively to the

4RCE

and

gCE

approximations to

P ( Y )

= N ( m , ,

K ' ) .

Also, H ( ~ c E .

P I

= H ( P , & E ) and This duality also implies that neither method can be better than the other in all cases. For two unknowns, ex- ample 1 below shows that both methods always achieve equal cross-entropy, H ( Q C E , p ) = H ( p , QRCE). For 3 or more unknowns, however, examples can be constructed in which the difference between H ( Q C E , p ) and H ( p , QRCE) is arbitrarily large (see Appendix B).

D.

Example 1-Two Variables

Let y I , y 2 be two jointly Gaussian real-valued random variables with means E l y , ] = m,, and covariances Cov ( y , , y , ) = R,]

. We will try approximating this model

with a separable model q 1 ( y 1 ) q 2 ( y 2 ) . The CE method gives

H ( p , 4 R C E ) = H ( 4 R C E ?

P ) .

R , = TR.,TH (3.21)

R,:' = T - ~ R . ; ' T - I . ( 3 . 2 2 ) Let [ T I i , , be the ith block of rows of T

(3.23)

so that [ T I ; , , is an N ; X N matrix, and

Yi = [ T I ; , * x . (3.24)

Similarly, let [ T - ' ] , , ; be the ith block of columns of T-I T - ' = ([ T - I _{* * *} _[T-'I*,,] (3.25) so that P x = T - I y = [ T - ' ] , ,y j . (3.26) i = I

t

TI,,*

t

T - l l * , j = 6,.J (3.27)

where 6, , is the Kroneker delta function, 6,

,

= 1 if i =

-._I

4 ( y 2 ) = N(m,?, R22 - R 2 1 R ; 1 R 1 2 ) (3.16) j , = O else. Equation (3.27) implies that [ T I ; , , is orthog- onal to all [ T - ' ] * , , for i # j , and has a unit projection onto [ T - ' ] , , ;. Then our two approximation methods give and the value of the cross-entropy function is

Note that the Cross-Entropy solution for the ith variance depends on [ T - ' ] * , ; , which in turn depends on the choice of all the linear transformations [ T I I , * ,

. . .

, [ T I p . * . The Reversed Cross-Entropy solution for the ith variance, on the other hand, depends only on [ T ] ;, *.

where

4 l ( Y l ) = N ( m , , , R i l )

4 2 ( Y 2 ) = N ( m w R22) (3.19)

and its cross-entropy is exactly equal to that of CE

(3.20) IV. POWER SPECTRUM ESTIMATION

H ( 4 C E , P ) = H ( p , Q R C E ) .

Note that the covariance of the CE estimates is smaller than the covariance of the RCE estimates. Also, note that if the unknowns are initially almost uncorrelated, then R I 2

= R{ = 0 . In this case, H ( i j C E , p ) = H ( p , QRCE) = 0 , and both separable models will be good fits to the original density. Notice, in fact, that both methods also give vir- tually the same answer, since R I , - R 1 2 R G 1 R 2 1 = R I ' .

On the other hand, if the variables are highly correlated, then R I I - R 1 2 RG1 R z l = 0 , the cross-entropy values H are large, and both separable approximations are equally poor. Note that in this case, CE assigns nearly zero variance to the variables, while RCE assigns variance R;;. E. Example 2-Invertible Transformations

Suppose M = N and T is a square and invertible trans-

We can easily apply these ideas to power spectrum estimation. Let x ( t ) be a complex-valued, multidimensional, stationary vector Gaussian random process, with zero mean and with stationary covariance kernel R ( t ) .

Assume that each sample x ( t ) is a vector of length p , and that the process is defined for all t in a hypercube, - U 5

t I U .

E [ x ( t ) ] = 0 for - U I t I U

E [ x ( t I ) x H ( t 2 ) ] = R ( t , - t 2 ) for - U I t , , t2 I U . ( 4 . 1 )

We will assume that R ( t ) is known for all t . (The derivation for real-valued x ( t ) would be slightly different, but the results would look the same.) Let us define X ( C O ) as

(5)

LIOU A N D MUSICUS: SEPARABLE CROSS-ENTROPY APPROACH 109

over the known domain

x(o>

=

j"

- U x ( t ) e - j w " d t . _{( 4 . 2 )}

Also, let P ( o ) be the Fourier Transform of R ( t )

r m

P ( o ) =

1

-m R(t)e-@""'dt. _{( 4 . 3 )}

Now suppose we measure a selected set of samples of this random process, x,, = x ( t n ) , where the samples t,

may be uniformly or nonuniformly distributed in the domain [ - U , U ] . The vector of samples x will be a zero

mean Gaussian random variable with covariance matrix

R, having elements [ R r l k , / = R ( t k - t , ) . The covariance matrix R., is positive definite and symmetric. If the sam-

ples tn are uniformly spaced on a rectilinear grid and or- ganized in raster scan fashion, then R, will also be block

Toeplitz.

Let us consider forming estimates of P selected fre-

quency samples ok of the Fourier Transform

X(

CO,) using

the given L data samples x,,. W e will form each estimate yk of X ( o k ) by taking an appropriate linear combination of the data samples

( 4 . 4 )

Let T k ( U ) be the transform of the kth set of coefficients

( 4 . 5 )

in which the frequency component estimates y, were modeled as exactly independent random variables. Hopefully, the variance of y, in this separable approximation would then be a good estimate of P ( o k ) . The separability of the model, moreover, will often simplify further analysis of the expected characteristics of the process.

To find a good separable model, we can apply the C E and RCE methods discussed in the previous sections. Col- lect the samples yk into a vector y, and the elements T,,n

into a matrix T . Assume that P = L and that T is invert-

ible. In the following subsections, we will compute the CE and RCE separable approximations, and show that the corresponding variances of y, can be treated as estimates of the power spectrum.

A . Reverse Cross-Entropy Spectral Estimation

The RCE estimate is somewhat easier to interpret, so we will treat it first. Each y, is constructed by filtering the data x,, through a filter Tk,,, as in (4.4). The RCE separable

approximation models the density

dk

( y k ) as the marginal density o f y k . If we define

pRcE(

a,) as the variance ofy,, then (3.28) shows that

RCE: Bk(yk) = N ( o , p R C € ( U L ) )

where

Applying Parseval's theorem to (4.4), we can show that

yk =

p

j

T , ( o ) X ( o ) d o . ( 4 . 6 )

To show that this is a reasonable power spectrum estimate, note that (4.7) implies that

I m

(27r) - m

.

Thus, yk will be a good approximation to

X(o,)

if we choose coefficients Tk,,, such that T k ( o ) is a good approximation to an impulse centered at w,. Furthermore, because the samples yk are linear combinations of Gauss- ian random variables, x,,, they will also be Gaussian ran-

dom variables with mean and covariance

I

pRCE(o,)

=

1

T , ( ~ ) P ( ~ ) T F ( ~ )

d o .

( 4 . 9 ) (27r) --OD

Thus, the Reverse Cross-Entropy method is equivalent to a classical windowing approach for estimating the power spectrum. If the Tk ( o ) are good bandpass filters centered at w k , then the estimate formed from the linear combination of correlations in (4.8) equals an average of the

power spectrum values around o k . For proper scaling in the RCE method, it seems reasonable to scale the filters

Tk( o ) such that if the actual power spectrum is flat, P ( w ) = Po, then

PRcE(

o ) = Po also. Applying this require-

ment to (4.9) gives the normalization constraints

1

= -

1

T , ( o ) P ( o ) T F ( o ) d o . ( 4 . 7 ) I m

(27ry --OD ~

j

T , ( o ) P o T ~ ( w > d o

( 2 4 -m

Thus, if the filters Tk ( o ) are designed with minimal overlap, then the correlation between different samples Y k and yl will be minimal. In other words, the samples y, will be

= Tk,,,PoT:,, = Po for all Po. ( 4 . 1 0 ) n

nearly independent zero-mean Gaussian random variables with variance related to the power spectrum sample

A particularly good choice for the

windowed

Tk,n would be

1

It would be convenient for many estimation problems T - -jw:t,,

P ( O k ) .

(6)

I IO IEEE TRANSACTIONS ON ACOUSTICS. SPEECH. A N D SIGNAL PROCESSING. VOL. 38. NO. I , JANUARY IYYO where the normalization factor 1

/&

is chosen to satisfy

(4.10). With this choice

and so the RCE power spectral density estimate is identical to a classical Bartlett (triangular) windowed periodogram estimate.

The difficulty with the RCE power spectrum is that, in general, the filters T k ( w ) overlap, so that R,, = TR,TH is far from diagonal. Leakage between the different filters implies that even if all the power in t,he actual power spectrum is concentrated near wk, our PRCE ( w ) will be nonzero over a wide range of frequencies. This is the familiar problem with sidelobe leakage in classical periodogram estimation.

B. Cross-Entropy Spectral Estimation

In the CE method, the component density

4,

( y , ) is estimated as equal to the conditional density of y L given that all the remaining components :re exactly equal to their a posreriori mean. If we define PcE( w k ) as the variance of y k in this separable approximation, then (3.28) implies that

CE: 4 L ( Y l ) = N(O9 F C E b L ) )

where

(4.12) The form of this estimate is highly reminiscent of a windowed form of Capon's MLM estimation method, in

which columns of the inverse filter matrix T-I replace the directional vectors in Capon's method. A simple interpretation of this effect is difficult, however, since the kth col- umn of T - ' will depend, in a nonlinear fashion, on the choice of all the filters T I ( w ), and the elements of

R,-' depend nonlinearly on all the given correlations The key to understanding this formula is to recognize that this is the conditconal variance of y k given the remaining components. p , , ( ~ ~ ) is thus the expected power in y , which cannot be predicted from knowledge of the other frequency estimates. This estimator therefore tends to minimize leakage from other frequency bands. A side effect of this leakage cancellation, however, is that leakage from the kth filter into adjacent filters will be used to partially predict the power in Y k , thus artificially reducing the power estimate PcE( w k ) below its proper value. This argument supports our observation in (3.14), which in this context implies that

R ( t n - tm).

It would be appropriate to choose different filter scaling for the CE method to compensate for this effect. In par- ticular, i f P ( w ) is flat, P ( w ) = Po, then R ( t ) = P 0 6 ( t )

and R, is block diagonal. If we require that

p,(

w ) = Po

in this case, then formula (4.12) suggests that we scale T

such that

[ Tp1Ifl:,Pi1 [ = P t l for all Po. (4.14) A particularly useful set of filters Tk, for the CE method

is found by designing them so that they do not overlap at their center frequencies

n

T k ( 0 , ) = &6k,/I for all k, 1 (4.15) where

&

is a scaling factor chosen to satisfy (4.14). The advantage of this idea is that if the signal is actually a sum of complex exponentials at frequencies w k with gains

xk

(4.16)

h

then it is easy to show that

yk = (4.17)

Thus, each filter output would only reflect the amplitude of the frequency component at w k , and would be insen- sitive to frequency components at the other frequencies

0 , . Another nice property of this method is that if we write (4.15) in matrix form, then we find that

TW = &I (4.18)

where W is a block DFT matrix, [ W ] , , , = eJo'fAI. Substi- tuting the resulting formula for T - ' into (4.12)

p C E ( w L ) = L

C

[ R ~ ' ] n , , l ~ e p J W A ( f " p * i ' ~ ) ] - I . (4.19) For this choice of filters, the CE power spectrum is exactly equal to Capon's MLM estimator. As is well known, this MLM method exhibits less sidelobe leakage than the periodogram (RCE) estimates.

[

n , m

V . FILTER DESIGN-UNIFORM SAMPLING When the samples tn and w k are selected on uniform rectilinear grids, with the number of time samples equal to the number of frequency samples, L = P , and the power spectrum is band-limited with the Nyquist frequency, then it is easy to design good bandpass filters T k ( w ) . In par- ticular, it is easy to show that the rectangular windowed filters suggested in (4.11) for the RCE method are identical to the "nonoverlapping" filters suggested in (4.15) for the CE method. In fact, this T matrix is orthonormal, with T-I = T H = W / & . The formula for the CE method becomes identical to the triangular Bartlett periodogram estimate, and the RCE method is identical to Capon's MLM method.

An interesting interpretation of these separable approximations results if we transform the separable densities

(7)

LIOU A N D MUSICUS: SEPARABLE CROSS-ENTROPY APPROACH 1 1 1 = N ( 0, R , ) has a block Toeplitz covariance matrix R , .

Using the fact that T - ’ = T H , the separable approximations can be shown to correspond to

& ( x ) = ~ ( 0 , T ~ A ~ ~ T ) (5.1)

4 R C E ( X ) = N ( O , THARCET) ( 5 . 2 )

and

where ACE and ARCE are block diagonal matrices with diagonal block elements PcE( ak) and

PRCE(

ak), respectively. The formulas for the covariance matrices in (5.1)

are thus in Jacobian normal form. The columns of T H ,

which are complex exponentials, are the eigenvectors of the matrix, and the diagonal elements of the A , which are uniformly spaced power spectrum sample estimates, are the eigenvalues. The resulting covariance matrices are thus different block circulant approximations to the original Toeplitz covariance matrix R,, in which each block

row is equal to the previous block row rotated right one entry. This is particularly fortuitous, since it is well known that as the size L of a noncirculant Toeplitz matrix R, goes to infinity, the eigenvalues of R, will approach uniformly

spaced samples of the power spectrum, and the eigenvectors will approach uniformly spaced complex exponentials [ l o ] , [ 1 1 1 . In the limit as L .--* 0 0 , we would therefore

expect both separable approximations to be identical to the original model.

VI. FILTER DESIGN-NONUNIFORM SAMPLING

Achieving good spectral estimates with nonuniformly spaced samples til, unfortunately, is much more difficult

[12], [13]. The fundamental difficulty is that when the t,,

are distributed in a grossly nonuniform manner, then it is difficult, if not impossible, to design good bandpass filters with nonzero coefficients Tk,l, located at t,!. Furthermore, since the Nyquist theorem does not apply, it is hard to select a natural band-limited assumption for the problem, and therefore it is not even obvious how to choose the frequencies w k . Another problem is that we will not be able to apply our elegant interpretation of these methods as circulant matrix approximations to Toeplitz covariance matrices, since the nonuniformly sampled covariance matrix R, will not be Toeplitz.

We have tried a large number of filter design strategies for various nonuniform grids tn, but have not had much success in achieving good results for both the CE and RCE methods simultaneously. A key difficulty is that in the nonuniform sampling case, the filters in (4.11) are quite

different from those in (4.15). Experimental evidence

suggests that the rectangular filters suggested in (4.11) for the RCE method work acceptably for RCE, but they lead to an ill-conditioned T - ’ matrix and give poor results for CE. Similarly, the nonoverlapped filters suggested in

(4.15) for C E work well for CE, but generally lead to an

ill-conditioned T matrix and give poor results for RCE. Other design strategies we have tried include using warped sinc functions to interpolate reasonable bandpass filters T k ( a), and iterative design techniques which iter-

ate between the time and frequency domains to find a low- pass filter shape with nonzero coefficients in the appropriate places. It is possible to find filters which work well in either the CE or the RCE methods; for example, the nonoverlapping filter designs in (4.15) work well for CE

and give the same result as Capon’s MLM method. Un-

fortunately, we could not find filters which work well in both methods. This is clearly a topic for further research.

VII. CONCLUSIONS

In this paper, we have applied two different cross-entropy criteria to the power spectrum estimation problem. In each case, given L samples of data, we build L linearly

independent filters to estimate the signal transform at L

different frequency samples ak. Provided these filters are designed to be good bandpass filters with minimal overlap, their outputs will be good estimates of the signal transform at a k , and their outputs will be nearly independent Gaussian random variables with covariance related to the power spectrum. In our cross-entropy methods, we force two different “optimal” separable approximations to the actual model, in which these filter outputs are ap- proximated as being exactly independent. One of these methods yields an estimate of the variance of each filter output which is similar to a windowed periodogram estimate. The other method yields an estimate of the variance which is similar to Capon’s MLM spectral estimate. In

the case of uniformly sampled data, uniform frequency samples, and rectangular filters, we get exactly a triangular window Bartlett spectral estimate and Capon’s MLM

estimate. The solutions in this case can be viewed as building circulant approximations to the original Toeplitz covariance matrix R,r.

The key to getting good spectral estimates from the CE and RCE methods is to choose the transform matrix T

carefully. In designing T , we need to keep several goals in mind. The samples y k must be accurate and robust es-

timates of

X(

ak), and the variance of each y a must be an accurate estimate of the power spectrum P ( a , ) . At the same time, the y k must be chosen to be as independent of

each other as possible, so that R,. is almost block diagonal,

and so that our separable appro.ximation is valid. One difficulty with both methods is that, given L sam-

ples of x,,, they estimate at most L samples from the power

spectrum, y,. This appears to be a fundamental statement about the achievable spectral resolution: given L data samples, only L independent frequency samples can be

estimated. In many applications, however, it would be useful to interpolate between these known samples to get a continuous power spectrum estimate at all frequencies. A related difficulty is involved in choosing the set of L

frequencies of interest, ak. If we change only one of these frequencies, and then redesign T for the new set, we may get somewhat different estimates of the power at all the frequencies.

For the special case of uniform spatial and frequency samples with rectangular windows, these issues are easily resolved. Fortunately, in this case, the spectral estimation

(8)

112 I E E E T R A N S A C T I O N S ON ACOUSTICS. SPEECH. A N D S I G N A L PROCESSING. VOL. 38. NO. I . J A N U A R Y i w n

formulas at frequency or; depend only on cor;, and do not explicitly refer to the remaining L - 1 samples. Simply

by replacing ok with a general frequency parameter w , we can convert these formulas into a continuous power spectrum estimate.

For nonuniformly sampled data, unfortunately, it is difficult to achieve all the design goals for these methods. Designing good bandpass filters on a nonuniform grid, for example, is often difficult or impossible. The aliasing that occurs cannot be simply expressed as a periodic summa- tion at the Nyquist frequency, but is a more complex dis- tortion of the spectrum. Picking a good set of frequencies to force to be independent is difficult. Theorems are needed to show that a separable approximation is asymptotically correct in the limit as the size of the nonuniformly sampled, non-Toeplitz covariance matrix R, goes to infinity. Perhaps a better method might force selected samples of the Laplace transform to be independent random variables, where the samples are not necessarily located on the imaginary axis. Although we have tried several strategies for designing nonuniform filters for grossly nonuniformly sampled data, we have been unable to get good spectral estimates from both methods with the same filter matrix T. The key to good performance, undoubt- edly, will be adaptively design the T filter matrix using the given covariance matrix R, in a manner similar to that used by Capon in deriving MLM. Much work remains to be done on modifying this idea for the cross-entropy spectral estimation techniques.

APPENDIX A

DERIVATION OF CE METHOD

To solve (3.3), let us focus on the ith density q, ( y, ).

We can rewrite (3.3) as

+

other terms ( A . 1 )

where the other terms do not depend on q l , and where:

1% +I(YJ =

s

l o g P ( Y , ( { Y , , j f

'1)

/ = I

fr

4/(Y,) dy,.

J"

( A . 2 )

[i

i f y is real

E =

( I i f y is complex

and where E [

.

I

q / ] is the expectation with respect to density q J . The minimum of H ( q, p ) over qi must occur at the value which minimizes (A. I) subject to the constraint

that q, ( yi) dyi = 1 . Equation (A. l ) , however, is strictly convex in qi, and thus it is easy to show that the unique minimum is

4 i ( Y l > = Ki+i(Yi) ( A . 5 )

where Ki is a normalization constant. Formula (A.3) then implies that the

gi

( y i ) are Gaussian

All that remains is to solve for the

FI

values. Note that p, = E [ Y,

I

Q,]. Substituting this into (A.4), multiplying both sides by [ R;'],., , and rearranging, gives

-

P

c

[R,'l1,

(I,

- m,,) = 0 f o r i = 1;

-

, P. J = I

( A . 7 ) Combining terms into matrices and vectors in the obvious way,

R\-'(ji - m,) 0. ('4.8)

Since RIp1 is assumed to be full rank,

The U posterior means

Fi

must therefore exactly equal to

the a priori means m,,,, and thus the probability density estimate qcE which minimizes the Shore-Johnson cross- entropy is as given in (3.4).

APPENDIX B

CROSSENTROPY FOR THREE UNKNOWNS

Consider a problem with three real-valued unknowns with covariance matrix

0 PI3

Substituting the formula for the conditional density [de-

rived from (2.6)] into this and evaluating the expectations R, = Cov

(:/)

=

(:

l p ; ) . (B.1)

PI3 P23 over y , , * *

.

1 Y, - 1 7 Y, + I . . . . ,Y/J gives

Then log +,(Y,> = -'E(Y, - Ff)HIR;ll,.,(Yf -

P I )

+

constant

(9)

LlOU AND MUSICUS. SEPARABLE CROSS-ENTROPY APPROACH I I3

Equations (3.6) and (3.12) show that

Subtracting gives

P13P23 (B.4)

1 - d 3 - P I 3

Thus, the difference between the cross-entropy measures of the two separable approximations can be arbitrarily large in a positive direction. If we had started instead with a covariance matrix Ry with the formula in (B.2), then the difference would have been arbitrarily large in a negative direction. Thus, for three or more unknowns, the value of the minimum cross-entropies for the CE and RCE methods can be arbitrarily different.

ACKNOWLEDGMENT

We would like to thank G. Duckworth and J. K. Van- diver for numerous discussions on this topic, and D. Cobra for programming assistance with filter design.

REFERENCES

[ I ] G . M. Ienkins and D. G . Watts, Spectral Analysis and I t s Applicu-

tions.

121 R . B. Blackman and J . W. Tukey. The Measurement o f P o w e r Spec-

t r a . New York: Dover, 1958.

131 P. D. Welch, “The use of fast Fourier transform for the estimation of power spectra: A method based on time averaging over short, mod- ified periodigrarns,” IEEE Trans. Audio Electroacousr., vol. AU-15. pp. 70-73; reprinted in Modern Spectrum Analj~sis. D. G. Childers. Ed.

[4] J . P. Burg, “Maximum entropy spectral analysis,” Ph.D. dissertation, Stanford Univ., May 1975.

[SI T. L. Marzetta, “A new interpretation for Capon’s maximum likeli- hood method of frequency-wavenumber spectral estimation.” IEEE

Trans. Acoust., Speech, Signal Processing, vol. ASSP-3 1, pp. 445- 449, Apr. 1983.

161 R. T. Lacoss, “Data adaptive spectral analysis methods,” Geophys-

ics, vol. 36, pp. 661-675; reprinted in Modern Spectrum Anu/?sis, D. G. Childers, Ed.

San Francisco, CA: Holden-Day, 1968.

New York: IEEE Press, June 1967.

New York: IEEE Press, 1978.

E. R . Musicus, “Iterative algorithms for optimal signal reconstruc- tion and parameter identification given noisy and incomplete data,’’ Ph.D. dissertation, M.I.T., Dep. EECS, Aug. 1982.

S . Kullback, Information Theor? and Statistics. New York: Wiley, 1959.

1. E. Shore and R . W. Johnson. “Axiomatic derivation of the principle of maximum entropy and the principle of minimum cross-en- tropy,” IEEE Trans. Inform. Theory, vol. IT-26, pp. 26-37, Jan.

1980.

R. Molten Gray, “On the asymptotic eigenvalue distribution of Toe- plitz matrices,” IEEE Trans. Inform. Theory. vol. IT-18. pp. 725- 730. Nov. 1972.

U. Grenander and G . Szego, Toeplirz Forms and Their Applicutions. Berkeley, CA: University of California Press, 1958.

R. H. Jones, “Estimation of spatial wavenumber spectra and falloff rate with unequally spaced observations,” J . Astr. S c i . , vol. 32, no. 23, pp. 260-268.

P. R. Julian and A. K . Cline, “The direct estimation of spatial wave- number spectra of atmospheric variables,” J . Atmos. Sci., vol. 31,

no. 6, pp. 1526-1539, Sept. 1974.

Cheng-Yuan Liou (M’86) was born in Taiwan o n November 14. 1951. He received the B.S. degree i n physics from National Central University in 1974, the M.S. degree in physical oceanography from National Taiwan University in 1978. and the Ph.D. degree in ocean engineering from the Mas- sachusetts Institute of Technology in 1985.

From 1986 to 1987 he was Visiting Associate Professor in the Institute of Applied Mechanics, National Taiwan University. where he taught courses in stochastic ~ r o c e s s e s and digital signal

_-

-processing and did research in sonar array signal -processing and signal diagnostic systems. In 1988 he joined the Faculty at the same university, where he is currently Associate Professor in the Department of Computer Science and Information Engineering. His current interests include array signal processing and neural networks.

Dr. Liou is a member of the International Neural Network Society.

Bruce K. Musicus (S’77-M’81) was born in Chi- cago, IL. in 1954. He received the S . B . degree from Harvard University. Cambridge. MA, in 1975. the M.S. and E.E. degrees froin M.I.T. i n 1979. and the Ph.D. degree from M.I.T. in 1982. all in electrical engineering and computer science. He was an Assistant Professor of Electrical En- gineering and Computer Science at the Massachu- setts Institute of Technology, from 1982 to 1986. and has been an Associate Professor there since then. He has sat on the Rockwell International Ca- reer Development chair at M.I.T. His research interests include algorithms and architectures for digital signal processing. stochastic estimation, and system identification; and he has published numerous papers in these fields.