• 沒有找到結果。

2010 IEEE International Conference on Acoustics, Speech, and Signal Processing ICASSP 2010

N/A
N/A
Protected

Academic year: 2022

Share "2010 IEEE International Conference on Acoustics, Speech, and Signal Processing ICASSP 2010"

Copied!
5
0
0

加載中.... (立即查看全文)

全文

(1)

2010 IEEE International

Conference on Acoustics, Speech, and Signal Processing

ICASSP 2010

March 14-19, 2010 • Dallas, Texas, U.S.A.

General Chair's Welcome Technical Program Overview Organizing Committee

Technical Program Committee Area Chairs

Reviewers Session Index Author Index Help

©2010 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.

IEEE Catalog Number: CFP10ICA-CDR ISBN: 978-1-4244-4296-6 ISSN: 1520-6149

Page 1 of 1 ICASSP 2010

2010-3-27

file://E:\ICASSP2010.html

(2)

A STUDY OF SEVERAL MODEL SELECTION CRITERIA FOR DETERMINING THE NUMBER OF SIGNALS

Shikui Tu and Lei Xu

Department of Computer Science and Engineering,

The Chinese University of Hong Kong, Shatin, N.T., Hong Kong. {sktu,lxu}@cse.cuhk.edu.hk

ABSTRACT

Addressing the problem of detecting the number of source signals as selecting the hidden dimensionality of Factor Anal- ysis (FA) model, we investigate several model selection cri- teria via a new empirical analyzing tool that examines the joint effect of signal-noise ratio (SNR) and sample sizeN on the model selection performance. The contours of the model selection accuracies visualize a three-region partition on the space of SNR andN , and a diminishing marginal effect which trades off SNR andN on the performance. Moreover, the newly derived Variational Bayes algorithm and three variants of Bayesian Ying-Yang (BYY) algorithms are more robust against reducing SNR andN , where the BYY with priors’

hyperparameters updated is the best in general.

Index Terms— Number of signals, hidden dimensional- ity, linear model, model selection, criteria

1. INTRODUCTION

It is an essential issue to detect the number of underlying source signals in many signal processing problems such as sensor array processing, the poles retrieval of a system response, the direction of arrival estimation by a smart an- tenna system, retrieving the overlapping echoes from radar backscatter and so on (see e.g., [1]). The observed vector can be modeled as a superposition of a finite number of underly- ing source signals with an additive noise. The source signals and noise vector sequence are assumed to be two independent ergodic zero-mean Gaussian random processes. Moreover, this issue can also be addressed as a model selection problem of selecting the hidden dimensionality of Factor Analysis (FA) [2] in its special case of Principal Component Analysis (PCA) [3] under the Maximum Likelihood principle.

A classical approach to model selection is the two-stage procedure, i.e., parameter learning is made on a set of can- didate models, among which one is selected by a model se- lection criterion. The existing criteria include Akaike’s Infor- mation Criterion (AIC)[4], Bozdogan’s Consistent Akaike’s

∗ Corresponding author: Lei Xu. Email: [email protected]. The work described in this paper was fully supported by a grant from the Research Grant Council of the Hong Kong SAR (Project No: CUHK4177/07E).

Information Criterion (CAIC)[5], Hannan-Quinn informa- tion criterion (HQC) [6], Schwarz’s Bayesian Information Criterion (BIC)[7] (which coincides with Rissanen’s Mini- mum Description Length (MDL)[8]), and recently developed Minka’s criterion (MK)[9], Variational Bayes (VB) [10], Bayesian Ying-Yang (BYY) harmony learning criterion[11].

Early from [1], AIC and MDL were introduced to deter- mine the number of signals, and then it was followed by a lot of researches such as [12], with emphasis on the asymptotic properties of the criteria under certain assumptions. Follow- ing the above track, this paper aims at a systematic investiga- tion on those criteria via a new empirical analyzing tool that examines the joint effect of the signal-noise ratio (SNR) and sample sizeN rather than the effect of either SNR or N with the other fixed in previous work, e.g., [13].

The adopted parameterization of FA is different from the common one in e.g., [10]. The two forms are equivalent in Maximum Likelihood learning, but different in model selec- tion as pointed out in [14] under BYY. Actually, the adopted form results in a better model selection ability under BYY and VB with details in another working paper[15]. Here, we im- plement the EM algorithm [3] for classical AIC etc. For VB, we derive a new VBEM algorithm by imposing appropriate priors on the unknown parameters. In addition to the existing prior-free version (BYYo)[16], BYY is further implemented by not only adopting the same priors (BYYp) as the VBEM but also updating the hyperparameters of the priors (BYYph) under the Hessian based second-order information conserva- tion principle [16].

By varying a wide range ofN and SNR in the empirical analysis, we connect the contour of the same model selection accuracy, and the contours actually define a family of model selection performance indifference curves (a term borrowed from economics) for each criterion. Then, we are able to re- veal a diminishing marginal effect that the amount of SNR (orN ) to trade for a unit of N (or SNR) increases if the per- formance is kept unchanged, and also able to present a three- region partition on the space ofN and SNR, i.e., all methods perform well/bad when SNR and N are too large/small re- spectively, while within the region with moderate SNR and N , the performances of these methods demonstrate diversity clearly. Moreover, VB and three variants of BYY outperform

(3)

the others in the region of diversity, while BYYph is the best in general.

The rest of the paper is organized as follows. Section 2 formulates the problem of determining the number of signals as estimating the hidden dimensionality of FA. Section 3 in- troduces the two-stage procedure with several model selection criteria whose behaviors are empirically analyzed in section 4 followed by the concluding remarks in section 5.

2. PROBLEM FORMULATION

In signal processing [1], a common model for the received complex-valued signal vectorx(t) from an array of n sensors at time instancet, isx(t) = As(t) + e(t), where A is the steering matrix with full column rank. The m-dimensional source signal vector sequence{s(t)} is assumed to be a sta- tionary and ergodic Gaussian random process with zero mean and positive definite covariance matrix Σs. The noise se- quence{e(t)} is assumed to be a stationary and ergodic Gaus- sian vector process, independent of the source signals, with zero mean and isotropic covariance matrixσ2eIn, whereInis then× n unit matrix. Determining the number of source sig- nals based on an observed sequence{x(t)}Nt=1is to estimate the rank ofsAHinΣx|c=sAH+ σe2InwhereΣx|c is the population covariance matrix of the received data, and the superscript “H” means the complex conjugate transpose.

On the other hand, a model called Factor Analysis (FA) in machine learning [16, 3] and statistics [2], assumes an ob- served real-valuedn-dimensional random variablex as fol- lows:

⎧⎪

⎪⎨

⎪⎪

x = Uy + μ + e, Θm={U, μ, Λ, Σe};

q(x|y) = G(x|Uy + μ, Σe), q(y) = G(y|0, Λ), q(x|Θm) =

p(x|y)p(y)dy = G(x|μ, Σx), Λ = diag[λ1, . . . , λm],Σx=UΛUT +Σe,

(1)

whereUTU = Im, andΣxis the population matrix of the data and Σe = σe2In which makes FA equivalent to PCA under the maximum likelihood principle [3]. Estimating the hidden dimensionalitym is to determine the rank ofUΛU in Σx=UΛUT+ σe2Inbased on{xt}Nt=1.

The two rank estimation problems are equivalent in a sense that they aim to estimate the (same) rank m in re- spectively two similar sample covariance equations. Next, we focus on the latter one, which is also widely used as a dimensionality reduction technique for feature extraction.

3. MODEL SELECTION CRITERIA

The two-stage procedure performs parameter learning over a set of candidate models among which one is selected by a model selection criterion. Typical examples include the classical AIC [4], BIC/MDL [7, 8], CAIC [5], HQC[6], and recent Minka’s criterion (MK) [9] and VB [10], and BYY [16, 17], as well as the difference of negative log-likelihood (DNLL). They are briefly summarized in Tab.1.

criteria Stage-I Stage-II:m = arg minˆ mJ(m) DNLL J(m) = −L( ˆΘMLm ) + L( ˆΘMLm−1) AIC J(m) = −L( ˆΘMLm ) + dm

BIC EM alg. J(m) = −L( ˆΘMLm ) +dm2 ln N CAIC J(m) = −L( ˆΘMLm ) +dm2 (ln N + 1) HQC J(m) = −L( ˆΘMLm ) + dmln(ln N) MK eig J(m) by equation (30) in [9].

VB VBEM J(m) = −F(ˆpU, ˆpν, ˆpφ, ˆpY, m) BYY eq.(2) J(m) = −H(pq, Θm, Ξ) +12dm

Table 1. The two-stage procedures for several criteria are given, whereL( ˆΘMLm ) = maxΘmln q(XNm) is the max- imized log-likelihood of data setXN, anddm = nm + 1−

m(m−1)

2 is the number of free parameters in FA. In Stage-I, the “EM alg.” denotes the Expectation Maximization (EM) algorithm for FA; the “eig” means estimating the sample eigenvalues for MK; theF(ˆpU, ˆpν, ˆpφ, ˆpY, m) is the resulted variational lower bound by VBEM; the H(pq, Θm, Ξ) is the resulted harmony functional by implementing eq.(2).

One difficulty in Bayesian model selection is to com- pute the marginal likelihood which incorporates priors on the parameters and involves a high dimensional integra- tion. To approximate the marginal likelihood, Minka [9]

proposed a criterion (MK) via Laplace approximation. Vari- ational Bayes (VB) [10] is another way to approximate the (log) marginal likelihood with a lower bound by means of the variational methods. Since no VB algorithm exists for FA by eq.(1), we derive one in this paper by adopting a uniform prior over the Stiefel manifold used in [9] for U, i.e., q(U) = 2−m

iΓ(n−i+12 n−i+12 , the commonly used Gamma density as priors for the precision parame- ters, i.e., q(ν|aν, bν) = m

i=1Γ(νi|aνi, bνi), q(ϕ|aϕ, bϕ) = Γ(ϕ|aϕ, bϕ), with ν = Λ−1 andϕ = (σe2)−1. Then, the VBEM algorithm implementsmax{pU,pν,pϕ,pY}F for each candidate scalem, andF is the variational lower bound:

F =



pUpνpϕpY ln

q(XN, Y|Θ)q(Θ|Ξ) pUpνpϕpY

dY dU dν dϕ, where the posterior is constrained to be in a factorized form of pUpνpϕpY, andq(XN, Y|Θ) = 

tq(xt|yt)q(yt) is given by eq.(1), andq(Θ|Ξ) = q(U) q(ν|aν, bν)q(ϕ|aϕ, bϕ).

Firstly proposed in [11] and systematically developed over a decade [16, 17], Bayesian Ying-Yang (BYY) harmony learning theory is a general statistical learning framework that can handle both parameter learning and model selec- tion under a best harmony principle. Given in [16, 17], the general two-stage procedure of BYY harmony learning is summarized in Tab.1, and the Stage-I is to implement

I(a):Θ(τ)m = arg max /incrΘmH(pq, Θm, Ξ(τ−1)), I(b):Ξ(τ)= arg max /incrΞ{H(pq, Θ(τ)m , Ξ) +12d(Ξ)}, d(Ξ) =−dm+ (Θ(τ−1)m − Θ(τ)m )Ω(Θ(τ−1)m − Θ(τ)m ), (2) where Ω = 2ΘΘTH(pq, Θ(τ)m , Ξ). Specifically the har-

1967

(4)

band (%) very good not good

80 ∼ 100 most criteria DNLL, AIC

40 ∼ 80 BYY,VB BIC,CAIC,DNLL

0 ∼ 40 BYY,VB the rest criteria

low SNR VB{7 red ∗} the rest criteria get {1.5, 2.0} BYY{8 red ∗ = 1(BYYo)+ fewer than3 red ∗

3(BYYp)+4(BYYph)}

smallN BYYph gets most red the rest criteria

<= 75 get few red

Table 2. The comparisons are based on (1) the band area between the specified contour lines(%) (the bigger and closer to left corner, the better),(2) the number of red asterisk (∗) (the more, the better).

mony functional for FA by eq.(1) is H(pq, Θm, Ξ) =

tln G(xt|0, Σx) +N ln

(2πe)ny|x| + dr( W ) + ln q(Θ|Ξ), (3) wheredr( W ) = −T r[ΔTWy|x)−1ΔWSN] will vanish as the differenceΔW = W − W converges to zero, i.e. the free parameter W converges to W =ΛUΣ−1x , withΣy|x = Λ−1+UTΣ−1e U and Σxgiven in eq.(1).

Ignoring priors by lettingq(Θ|Ξ) = 1, BYY (denoted as

“BYYo”) still possesses a good model selection ability[16].

By eq.(2), we further implement BYY (denoted as “BYYp”) by adopting the same priors as used in VB, so thatln q(Θ|Ξ) in eq.(3) plays a role of regularization. By I(b) in eq.(2), the hyperparameters are updated in BYY (named “BYYph”) to further increase the harmony functional. All BYY algorithms are implemented by the gradient method.

4. EMPIRICAL ANALYSIS

Empirical analysis is based on a series of controlled exper- iments by varying a wide range of the sample size N {25, 50, 75, 100, 200, 400, 800}, SNR γo = λσm∗2

e + 1 {1.2, 1.5, 2, 2.5, 3, 3.5, 4, 8, 16}, where λi= . . . = λm = 1, and n, m (i.e., dim(x), dim(y)) are respectively fixed at 15, 5 due to the space limit. For each of 102 independent runs, the two-stage procedure for every criterion is made on the set of candidate modelsM = {1, . . . , 9} based on a data setXNrandomly generated according to a chosen setting for N, γo. We report the percentages of the successful selections, i.e.,m = mˆ , in the form of contour maps in Fig.1.

The contour maps define a family of model selection indifference curves that visualize the performance over the space of SNR andN . The performances decrease as N and SNR reduce. Also, it can be observed from Fig.1 that (1) a three-region partition, i.e., all criteria perform well/bad when SNR andN are large/too small, while the region with moderate SNR andN differentiates those criteria well; (2) a diminishing marginal effect, i.e., the amount of SNR (or N ) to trade for a unit loss of N (or SNR) increases as moving down an indifference curve.

Detailed observations from Fig.1,2 are listed in Tab.3. VB and three variants of BYY are relatively more robust than

the other criteria against reducing the sample size and SNR, where BYYph performs the best in general.

All methods are also evaluated on a real world dataset Pendigits1(16 attributes, 10 classes, 10992 instances). Sim- ilarly, we vary the training sample sizeN . The classifica- tion results basically coincide with the model selection per- formance on synthetic data.

% N = 16 N = 30 N = 100

AIC 56.46 ± 6.20 88.91 ± 1.72 96.61 ± 0.39 BIC 56.46 ± 6.19 87.72 ± 1.92 96.64 ± 0.40 HQC 56.46 ± 6.20 88.51 ± 1.59 96.62 ± 0.31 CAIC 48.34 ± 12.7 87.63 ± 1.62 96.63 ± 0.33 DNLL 79.18 ± 9.01 86.48 ± 1.71 91.26 ± 0.58 VB 87.02 ± 2.45 93.86 ± 1.31 96.19 ± 0.31 BYYph 88.57 ± 1.04 94.15 ± 0.33 96.29 ± 0.16 Table 3. Classification accuracies mean±stdev of 102runs.

5. CONCLUSION

Based on the problem of determining the number of underly- ing source signals, we have investigated the relative strengths and weaknesses of not only the classical AIC, BIC/MDL, CAIC, HQC, but also recently developed Minka’s criterion, VB and BYY. We derive a new VB algorithm for FA by im- posing appropriate priors, which are also adopted in BYY for further implementations. The investigation is made via a new empirical analyzing tool featured by model selection indiffer- ence curves which reveal a three-region partition and a dimin- ishing marginal effect. Moreover, the BYY with the priors’

hyperparaemters updated is the best in general.

6. REFERENCES

[1] M. Wax and T. Kailath, “Detection of signals by information theoretic criteria,” IEEE Trans. on Acoustics, Speech and Sig- nal Processing, vol. ASSP-33, no. 2, pp. 387, 1985.

[2] T.W. Anderson and H. Rubin, “Statistical inference in factor analysis,” in Proc. of third Berkeley symposium on mathemati- cal statistics and probability, 1956, vol. 5, pp. 111–150.

[3] Michael E. Tipping and Christopher M. Bishop, “Mixtures of probabilistic principal component analyzers,” Neural Compu- tation, vol. 11, no. 2, pp. 443–482, 1999.

[4] H. Akaike, “A new look at the statistical model identification,”

IEEE Transactions on Automatic Control, vol. 19, no. 6, pp.

716–723, Dec 1974.

[5] Hamparsum Bozdogan, “Model selection and Akaike’s Infor- mation Criterion (AIC): The general theory and its analytical extensions,” Psychometrika, vol. 52, no. 3, pp. 345–370, 1987.

[6] E. J. Hannan, A. J. McDougall, and D. S. Poskitt, “The de- termination of the order of an autoregression,” Journal of the Royal Statistical Society. Series B (Methodological), vol. 51, pp. 217–233, 1989.

[7] Gideon Schwarz, “Estimating the Dimension of a Model,” The Annual of Statistics, vol. 6, no. 2, pp. 461–464, 1978.

1from UCI repository: http://archive.ics.uci.edu/ml/datasets.html

(5)

20

20

20

20 40

40

40

40 60

60

60

60 80

80

80 80

80

80 80

SNR (adjusted)

Sample Size N (adjusted)

Successful−Selection rates by [AIC]

1.2 1.5 2.0 2.5 3.0 3.5 4.0 8 16

25 50 75 100 200 400 800

20

20

20

20 40

40

40

40 60

60

60

60 80

80

80

80 100

100

100

100

SNR (adjusted)

Sample Size N (adjusted)

Successful−Selection rates by [BIC]

1.2 1.5 2.0 2.5 3.0 3.5 4.0 8 16

25 50 75 100 200 400

800 20

20

20

20 40

40

40

40 60

60

60

60 80

80

80

80 100

100

100

100

SNR (adjusted)

Sample Size N (adjusted)

Successful−Selection rates by [CAIC]

1.2 1.5 2.0 2.5 3.0 3.5 4.0 8 16

25 50 75 100 200 400 800

20

20

20

20 40

40

40

40 60

60

60

60 80

80

80

80 80 100

100

100

SNR (adjusted)

Sample Size N (adjusted)

Successful−Selection rates by [HQC]

1.2 1.5 2.0 2.5 3.0 3.5 4.0 8 16

25 50 75 100 200 400 800

20

20

20

20 40

40

40

40 60

60

60

60 80

80

80

80

100

100

100

100

SNR (adjusted)

Sample Size N (adjusted)

Successful−Selection rates by [MK]

1.2 1.5 2.0 2.5 3.0 3.5 4.0 8 16

25 50 75 100 200 400 800

40 40

40 40

40 60

60

60

60 80

80

80

80 100

100

100

100

SNR (adjusted)

Sample Size N (adjusted)

Successful−Selection rates by [VB]

1.2 1.5 2.0 2.5 3.0 3.5 4.0 8 16

25 50 75 100 200 400 800

40

40

40 60

60

60

60 80

80

80

80 100

100

100

SNR (adjusted)

Sample Size N (adjusted)

Successful−Selection rates by [BYYo]

1.2 1.5 2.0 2.5 3.0 3.5 4.0 8 16

25 50 75 100 200 400 800

40

40

40 60

60

60

60 80

80

80

80 100

100

100

SNR (adjusted)

Sample Size N (adjusted)

Successful−Selection rates by [BYYp]

1.2 1.5 2.0 2.5 3.0 3.5 4.0 8 16

25 50 75 100 200 400 800

40

40

40 60

60

60

60 80

80

80

80 100

100

100

SNR (adjusted)

Sample Size N (adjusted)

Successful−Selection rates by [BYYph]

1.2 1.5 2.0 2.5 3.0 3.5 4.0 8 16

25 50 75 100 200 400 800

Fig. 1. Contour maps of successful selection rates of all criteria are drawn against to adjusted axes (i.e., equal space among setting values). A red asterisk (∗) indicates the corresponding criterion gets the highest successful selection rate at that setting.

[8] J Rissanen, “Modelling by the shortest data description,” Au- tomatica, vol. 14, pp. 465–471, 1978.

[9] Thomas P. Minka, “Automatic choice of dimensionality for PCA,” in Advances in Neural Information Processing Systems 13, 2001, pp. 598–604.

[10] Matthew J. Beal, Variational Algorithms for Approximate Bayesian Inference, Ph.D. thesis, Gatsby Computational Neu- roscience Unit, University College London, 2003.

[11] Lei Xu, “Bayesian-Kullback coupled Ying-Yang machines:

Unified learnings and new results on vector quantization,” in International Conference on Neural Information Processing (ICONIP), 1995, pp. 977–988.

[12] E. Fishler, M. Grosmann, and H. Messer, “Detection of sig- nals by information theoretic criteria: general asymptotic per- formance analysis,” IEEE Transactions on Signal Processing, vol. 50, no. 5, pp. 1027–1036, May 2002.

[13] Shikui Tu and Lei Xu, “Theoretical analysis and comparison of several criteria on linear model dimension reduction,” in ICA

’09: Proceedings of the 8th International Conference on In- dependent Component Analysis and Signal Separation, Berlin, Heidelberg, 2009, pp. 154–162, Springer-Verlag.

[14] Lei Xu, “Bayesian Ying-Yang learning theory for data dimen- sion reduction and determination,” Journal of Computational Intelligence in Finance, vol. 6, no. 5, pp. 6–18, 1998.

[15] Shikui Tu and Lei Xu, “On the two parameterizations of factor analysis: which one is better?,” (In preparation), 2009.

[16] Lei Xu, “Bayesian Ying Yang Learning,” in Schol- arpedia 2(3):1809, http://scholarpedia.org/article/Bayesian

Ying Yang Learning, 2007.

[17] Lei Xu, “Bayesian Ying-Yang System, Best Harmony Learn- ing, and Five Action Circling,” to appear in an invited spe- cial issue on Emerging Themes on Information Theory and Bayesian Approach, Frontiers of Electrical and Electronic En- gineering in China, a journal jointly published by Higher Edu- cation Press of China and Springer, 2010.

20

20

20

20 40

40

40

40 60

60

60

60 80

80

80 100

100

SNR (adjusted)

Sample Size N (adjusted)

Successful−Selection rates by [DNLL]

1.2 1.5 2.0 2.5 3.0 3.5 4.0 8 16

25 50 75 100 200 400 800

Fig. 2. Continue Fig.1 for DNLL.

1969

數據

Table 1. The two-stage procedures for several criteria are given, where L( ˆ Θ ML m ) = max Θ m ln q(X N |Θ m ) is the  max-imized log-likelihood of data set X N , and d m = nm + 1 −
Fig. 1. Contour maps of successful selection rates of all criteria are drawn against to adjusted axes (i.e., equal space among setting values)

參考文獻

相關文件

which can be used (i) to test specific assumptions about the distribution of speed and accuracy in a population of test takers and (ii) to iteratively build a structural

volume suppressed mass: (TeV) 2 /M P ∼ 10 −4 eV → mm range can be experimentally tested for any number of extra dimensions - Light U(1) gauge bosons: no derivative couplings. =&gt;

Define instead the imaginary.. potential, magnetic field, lattice…) Dirac-BdG Hamiltonian:. with small, and matrix

Microphone and 600 ohm line conduits shall be mechanically and electrically connected to receptacle boxes and electrically grounded to the audio system ground point.. Lines in

• Any node that does not have a local replica of the object periodically creates a QoS-advert message contains (a) its δ i deadline value and (b) depending-on , the ID of the node

In order to investigate the bone conduction phenomena of hearing, the finite element model of mastoid, temporal bone and skull of the patient is created.. The 3D geometric model

On the content classification of commercial, we modified a classic model of the vector space to build the classification model of commercial audio, and then identify every kind

Godsill, “Detection of abrupt spectral changes using support vector machines: an application to audio signal segmentation,” Proceedings of the IEEE International Conference