2010 IEEE International Conference on Acoustics, Speech, and Signal Processing ICASSP 2010

(1)

2010 IEEE International

Conference on Acoustics, Speech, and Signal Processing

ICASSP 2010

March 14-19, 2010 • Dallas, Texas, U.S.A.

General Chair's Welcome Technical Program Overview Organizing Committee

Technical Program Committee Area Chairs

Reviewers Session Index Author Index Help

©2010 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.

IEEE Catalog Number: CFP10ICA-CDR ISBN: 978-1-4244-4296-6 ISSN: 1520-6149

Page 1 of 1 ICASSP 2010

2010-3-27

file://E:\ICASSP2010.html

(2)

A STUDY OF SEVERAL MODEL SELECTION CRITERIA FOR DETERMINING THE NUMBER OF SIGNALS

Shikui Tu and Lei Xu

^∗

Department of Computer Science and Engineering,

The Chinese University of Hong Kong, Shatin, N.T., Hong Kong. {sktu,lxu}@cse.cuhk.edu.hk

ABSTRACT

Addressing the problem of detecting the number of source signals as selecting the hidden dimensionality of Factor Anal- ysis (FA) model, we investigate several model selection criteria via a new empirical analyzing tool that examines the joint effect of signal-noise ratio (SNR) and sample sizeN on the model selection performance. The contours of the model selection accuracies visualize a three-region partition on the space of SNR andN , and a diminishing marginal effect which trades off SNR andN on the performance. Moreover, the newly derived Variational Bayes algorithm and three variants of Bayesian Ying-Yang (BYY) algorithms are more robust against reducing SNR andN , where the BYY with priors’

hyperparameters updated is the best in general.

Index Terms— Number of signals, hidden dimensional- ity, linear model, model selection, criteria

1. INTRODUCTION

It is an essential issue to detect the number of underlying source signals in many signal processing problems such as sensor array processing, the poles retrieval of a system response, the direction of arrival estimation by a smart an- tenna system, retrieving the overlapping echoes from radar backscatter and so on (see e.g., [1]). The observed vector can be modeled as a superposition of a ﬁnite number of underlying source signals with an additive noise. The source signals and noise vector sequence are assumed to be two independent ergodic zero-mean Gaussian random processes. Moreover, this issue can also be addressed as a model selection problem of selecting the hidden dimensionality of Factor Analysis (FA) [2] in its special case of Principal Component Analysis (PCA) [3] under the Maximum Likelihood principle.

A classical approach to model selection is the two-stage procedure, i.e., parameter learning is made on a set of candidate models, among which one is selected by a model selection criterion. The existing criteria include Akaike’s Infor- mation Criterion (AIC)[4], Bozdogan’s Consistent Akaike’s

∗ Corresponding author: Lei Xu. Email: [email protected]. The work described in this paper was fully supported by a grant from the Research Grant Council of the Hong Kong SAR (Project No: CUHK4177/07E).

Information Criterion (CAIC)[5], Hannan-Quinn information criterion (HQC) [6], Schwarz’s Bayesian Information Criterion (BIC)[7] (which coincides with Rissanen’s Mini- mum Description Length (MDL)[8]), and recently developed Minka’s criterion (MK)[9], Variational Bayes (VB) [10], Bayesian Ying-Yang (BYY) harmony learning criterion[11].

Early from [1], AIC and MDL were introduced to determine the number of signals, and then it was followed by a lot of researches such as [12], with emphasis on the asymptotic properties of the criteria under certain assumptions. Follow- ing the above track, this paper aims at a systematic investigation on those criteria via a new empirical analyzing tool that examines the joint effect of the signal-noise ratio (SNR) and sample sizeN rather than the effect of either SNR or N with the other ﬁxed in previous work, e.g., [13].

The adopted parameterization of FA is different from the common one in e.g., [10]. The two forms are equivalent in Maximum Likelihood learning, but different in model selection as pointed out in [14] under BYY. Actually, the adopted form results in a better model selection ability under BYY and VB with details in another working paper[15]. Here, we implement the EM algorithm [3] for classical AIC etc. For VB, we derive a new VBEM algorithm by imposing appropriate priors on the unknown parameters. In addition to the existing prior-free version (BYYo)[16], BYY is further implemented by not only adopting the same priors (BYYp) as the VBEM but also updating the hyperparameters of the priors (BYYph) under the Hessian based second-order information conserva- tion principle [16].

By varying a wide range ofN and SNR in the empirical analysis, we connect the contour of the same model selection accuracy, and the contours actually deﬁne a family of model selection performance indifference curves (a term borrowed from economics) for each criterion. Then, we are able to reveal a diminishing marginal effect that the amount of SNR (orN ) to trade for a unit of N (or SNR) increases if the performance is kept unchanged, and also able to present a three- region partition on the space ofN and SNR, i.e., all methods perform well/bad when SNR and N are too large/small re- spectively, while within the region with moderate SNR and N , the performances of these methods demonstrate diversity clearly. Moreover, VB and three variants of BYY outperform

(3)

the others in the region of diversity, while BYYph is the best in general.

The rest of the paper is organized as follows. Section 2 formulates the problem of determining the number of signals as estimating the hidden dimensionality of FA. Section 3 in- troduces the two-stage procedure with several model selection criteria whose behaviors are empirically analyzed in section 4 followed by the concluding remarks in section 5.

2. PROBLEM FORMULATION

In signal processing [1], a common model for the received complex-valued signal vectorx(t) from an array of n sensors at time instancet, isx(t) = As(t) + e(t), where A is the steering matrix with full column rank. The m-dimensional source signal vector sequence{s(t)} is assumed to be a stationary and ergodic Gaussian random process with zero mean and positive deﬁnite covariance matrix Σs. The noise sequence{e(t)} is assumed to be a stationary and ergodic Gaus- sian vector process, independent of the source signals, with zero mean and isotropic covariance matrixσ²_eIn, whereInis then× n unit matrix. Determining the number of source signals based on an observed sequence{x(t)}^N_t=1is to estimate the rank ofAΣsA^HinΣx|c=AΣsA^H+ σ_e²InwhereΣ_x|c is the population covariance matrix of the received data, and the superscript “H” means the complex conjugate transpose.

On the other hand, a model called Factor Analysis (FA) in machine learning [16, 3] and statistics [2], assumes an observed real-valuedn-dimensional random variablex as follows:

⎧⎪

⎪⎨

⎪⎪

⎩

x = Uy + μ + e, Θm={U, μ, Λ, Σe};

q(x|y) = G(x|Uy + μ, Σe), q(y) = G(y|0, Λ), q(x|Θm) =

p(x|y)p(y)dy = G(x|μ, Σx), Λ = diag[λ1, . . . , λ_m],Σx=UΛU^T +Σe,

(1)

whereU^TU = Im, andΣxis the population matrix of the data and Σe = σ_e²In which makes FA equivalent to PCA under the maximum likelihood principle [3]. Estimating the hidden dimensionalitym is to determine the rank ofUΛU in Σx=UΛU^T+ σ_e²Inbased on{xt}^N_t=1.

The two rank estimation problems are equivalent in a sense that they aim to estimate the (same) rank m in re- spectively two similar sample covariance equations. Next, we focus on the latter one, which is also widely used as a dimensionality reduction technique for feature extraction.

3. MODEL SELECTION CRITERIA

The two-stage procedure performs parameter learning over a set of candidate models among which one is selected by a model selection criterion. Typical examples include the classical AIC [4], BIC/MDL [7, 8], CAIC [5], HQC[6], and recent Minka’s criterion (MK) [9] and VB [10], and BYY [16, 17], as well as the difference of negative log-likelihood (DNLL). They are brieﬂy summarized in Tab.1.

criteria Stage-I Stage-II:m = arg minˆ mJ(m) DNLL J(m) = −L( ˆΘ^MLm ) + L( ˆΘ^MLm−1) AIC J(m) = −L( ˆΘ^MLm ) + dm

BIC EM alg. J(m) = −L( ˆΘ^MLm ) +^d^m₂ ln N CAIC J(m) = −L( ˆΘ^MLm ) +^d^m₂ (ln N + 1) HQC J(m) = −L( ˆΘ^MLm ) + dmln(ln N) MK eig J(m) by equation (30) in [9].

VB VBEM J(m) = −F(ˆpU, ˆpν, ˆpφ, ˆpY, m) BYY eq.(2) J(m) = −H(pq, Θ^∗m, Ξ^∗) +¹₂dm

Table 1. The two-stage procedures for several criteria are given, whereL( ˆΘ^ML_m ) = max_Θ_mln q(X_N|Θm) is the max- imized log-likelihood of data setX_N, andd_m = nm + 1−

m(m−1)

2 is the number of free parameters in FA. In Stage-I, the “EM alg.” denotes the Expectation Maximization (EM) algorithm for FA; the “eig” means estimating the sample eigenvalues for MK; theF(ˆpU, ˆp_ν, ˆp_φ, ˆp_Y, m) is the resulted variational lower bound by VBEM; the H(pq, Θ^∗_m, Ξ^∗) is the resulted harmony functional by implementing eq.(2).

One difﬁculty in Bayesian model selection is to com- pute the marginal likelihood which incorporates priors on the parameters and involves a high dimensional integra- tion. To approximate the marginal likelihood, Minka [9]

proposed a criterion (MK) via Laplace approximation. Vari- ational Bayes (VB) [10] is another way to approximate the (log) marginal likelihood with a lower bound by means of the variational methods. Since no VB algorithm exists for FA by eq.(1), we derive one in this paper by adopting a uniform prior over the Stiefel manifold used in [9] for U, i.e., q(U) = 2^−m

iΓ(ⁿ⁻ⁱ⁺¹₂ )π⁻ⁿ⁻ⁱ⁺¹² , the commonly used Gamma density as priors for the precision parameters, i.e., q(ν|a^ν, b^ν) = _m

i=1Γ(ν_i|a^ν_i, b^ν_i), q(ϕ|a^ϕ, b^ϕ) = Γ(ϕ|a^ϕ, b^ϕ), with ν = Λ⁻¹ andϕ = (σ_e²)⁻¹. Then, the VBEM algorithm implementsmax_{p_U_,p_ν_,p_ϕ_,p_Y_}F for each candidate scalem, andF is the variational lower bound:

F =

p_Up_νp_ϕp_Y ln

q(X_N, Y|Θ)q(Θ|Ξ) p_Up_νp_ϕp_Y

dY dU dν dϕ, where the posterior is constrained to be in a factorized form of p_Up_νp_ϕp_Y, andq(X_N, Y|Θ) =

tq(xt|yt)q(yt) is given by eq.(1), andq(Θ|Ξ) = q(U) q(ν|a^ν, b^ν)q(ϕ|a^ϕ, b^ϕ).

Firstly proposed in [11] and systematically developed over a decade [16, 17], Bayesian Ying-Yang (BYY) harmony learning theory is a general statistical learning framework that can handle both parameter learning and model selection under a best harmony principle. Given in [16, 17], the general two-stage procedure of BYY harmony learning is summarized in Tab.1, and the Stage-I is to implement

I(a):Θ^(τ)m = arg max /incr_Θ_mH(pq, Θm, Ξ^(τ−1)), I(b):Ξ^(τ)= arg max /incr_Ξ{H(pq, Θ^(τ)m , Ξ) +¹₂d(Ξ)}, d(Ξ) =−dm+ (Θ^(τ−1)m − Θ^(τ)m )Ω(Θ^(τ−1)m − Θ^(τ)m ), (2) where Ω = ∇²_ΘΘTH(pq, Θ^(τ)m , Ξ). Speciﬁcally the har-

1967

(4)

band (%) very good not good

80 ∼ 100 most criteria DNLL, AIC

40 ∼ 80 BYY,VB BIC,CAIC,DNLL

0 ∼ 40 BYY,VB the rest criteria

low SNR VB{7 red ∗} the rest criteria get {1.5, 2.0} BYY{8 red ∗ = 1(BYYo)+ fewer than3 red ∗

3(BYYp)+4(BYYph)}

smallN BYYph gets most red∗ the rest criteria

<= 75 get few red∗

Table 2. The comparisons are based on (1) the band area between the speciﬁed contour lines(%) (the bigger and closer to left corner, the better),(2) the number of red asterisk (∗) (the more, the better).

mony functional for FA by eq.(1) is H(pq, Θm, Ξ) =

tln G(xt|0, Σx) +N ln

(2πe)ⁿ|Σy|x| + dr(W ) + ln q(Θ|Ξ), (3) whered_r(W ) = −T r[Δ^T_W(Σ_y|x)⁻¹Δ_WS_N] will vanish as the differenceΔ_W = W − W converges to zero, i.e. the free parameter W converges to W =ΛUΣ⁻¹_x , withΣ_y|x = Λ⁻¹+U^TΣ⁻¹_e U and Σxgiven in eq.(1).

Ignoring priors by lettingq(Θ|Ξ) = 1, BYY (denoted as

“BYYo”) still possesses a good model selection ability[16].

By eq.(2), we further implement BYY (denoted as “BYYp”) by adopting the same priors as used in VB, so thatln q(Θ|Ξ) in eq.(3) plays a role of regularization. By I(b) in eq.(2), the hyperparameters are updated in BYY (named “BYYph”) to further increase the harmony functional. All BYY algorithms are implemented by the gradient method.

4. EMPIRICAL ANALYSIS

Empirical analysis is based on a series of controlled exper- iments by varying a wide range of the sample size N ∈ {25, 50, 75, 100, 200, 400, 800}, SNR γo = ^λ_σ^m∗₂

e + 1 ∈ {1.2, 1.5, 2, 2.5, 3, 3.5, 4, 8, 16}, where λi= . . . = λ_m∗ = 1, and n, m^∗ (i.e., dim(x), dim(y)) are respectively ﬁxed at 15, 5 due to the space limit. For each of 10² independent runs, the two-stage procedure for every criterion is made on the set of candidate modelsM = {1, . . . , 9} based on a data setX_Nrandomly generated according to a chosen setting for N, γ_o. We report the percentages of the successful selections, i.e.,m = mˆ ^∗, in the form of contour maps in Fig.1.

The contour maps deﬁne a family of model selection indifference curves that visualize the performance over the space of SNR andN . The performances decrease as N and SNR reduce. Also, it can be observed from Fig.1 that (1) a three-region partition, i.e., all criteria perform well/bad when SNR andN are large/too small, while the region with moderate SNR andN differentiates those criteria well; (2) a diminishing marginal effect, i.e., the amount of SNR (or N ) to trade for a unit loss of N (or SNR) increases as moving down an indifference curve.

Detailed observations from Fig.1,2 are listed in Tab.3. VB and three variants of BYY are relatively more robust than

the other criteria against reducing the sample size and SNR, where BYYph performs the best in general.

All methods are also evaluated on a real world dataset Pendigits¹(16 attributes, 10 classes, 10992 instances). Sim- ilarly, we vary the training sample sizeN . The classiﬁca- tion results basically coincide with the model selection performance on synthetic data.

% N = 16 N = 30 N = 100

AIC 56.46 ± 6.20 88.91 ± 1.72 96.61 ± 0.39 BIC 56.46 ± 6.19 87.72 ± 1.92 96.64 ± 0.40 HQC 56.46 ± 6.20 88.51 ± 1.59 96.62 ± 0.31 CAIC 48.34 ± 12.7 87.63 ± 1.62 96.63 ± 0.33 DNLL 79.18 ± 9.01 86.48 ± 1.71 91.26 ± 0.58 VB 87.02 ± 2.45 93.86 ± 1.31 96.19 ± 0.31 BYYph 88.57 ± 1.04 94.15 ± 0.33 96.29 ± 0.16 Table 3. Classiﬁcation accuracies mean±stdev of 10²runs.

5. CONCLUSION

Based on the problem of determining the number of underlying source signals, we have investigated the relative strengths and weaknesses of not only the classical AIC, BIC/MDL, CAIC, HQC, but also recently developed Minka’s criterion, VB and BYY. We derive a new VB algorithm for FA by imposing appropriate priors, which are also adopted in BYY for further implementations. The investigation is made via a new empirical analyzing tool featured by model selection indifference curves which reveal a three-region partition and a diminishing marginal effect. Moreover, the BYY with the priors’

hyperparaemters updated is the best in general.

6. REFERENCES

[1] M. Wax and T. Kailath, “Detection of signals by information theoretic criteria,” IEEE Trans. on Acoustics, Speech and Sig- nal Processing, vol. ASSP-33, no. 2, pp. 387, 1985.

[2] T.W. Anderson and H. Rubin, “Statistical inference in factor analysis,” in Proc. of third Berkeley symposium on mathemati- cal statistics and probability, 1956, vol. 5, pp. 111–150.

[3] Michael E. Tipping and Christopher M. Bishop, “Mixtures of probabilistic principal component analyzers,” Neural Compu- tation, vol. 11, no. 2, pp. 443–482, 1999.

[4] H. Akaike, “A new look at the statistical model identiﬁcation,”

IEEE Transactions on Automatic Control, vol. 19, no. 6, pp.

716–723, Dec 1974.

[5] Hamparsum Bozdogan, “Model selection and Akaike’s Infor- mation Criterion (AIC): The general theory and its analytical extensions,” Psychometrika, vol. 52, no. 3, pp. 345–370, 1987.

[6] E. J. Hannan, A. J. McDougall, and D. S. Poskitt, “The de- termination of the order of an autoregression,” Journal of the Royal Statistical Society. Series B (Methodological), vol. 51, pp. 217–233, 1989.

[7] Gideon Schwarz, “Estimating the Dimension of a Model,” The Annual of Statistics, vol. 6, no. 2, pp. 461–464, 1978.

1from UCI repository: http://archive.ics.uci.edu/ml/datasets.html

(5)

20

20 40

40

40 60

60

60 80

80

80 ⁸⁰

80

80 80

SNR (adjusted)

Sample Size N (adjusted)

Successful−Selection rates by [AIC]

1.2 1.5 2.0 2.5 3.0 3.5 4.0 8 16

25 50 75 100 200 400 800

20

20 40

40

40 60

60

60 80

80

80 100

100

SNR (adjusted)

Successful−Selection rates by [BIC]

1.2 1.5 2.0 2.5 3.0 3.5 4.0 8 16

25 50 75 100 200 400

800 20

20

20 40

40

40 60

60

60 80

80

80 100

100

SNR (adjusted)

Successful−Selection rates by [CAIC]

1.2 1.5 2.0 2.5 3.0 3.5 4.0 8 16

25 50 75 100 200 400 800

20

20 40

40

40 60

60

60 80

80

80 80 100

100

SNR (adjusted)

Successful−Selection rates by [HQC]

1.2 1.5 2.0 2.5 3.0 3.5 4.0 8 16

25 50 75 100 200 400 800

20

20 40

40

40 60

60

60 80

80

100

SNR (adjusted)

Successful−Selection rates by [MK]

1.2 1.5 2.0 2.5 3.0 3.5 4.0 8 16

25 50 75 100 200 400 800

40 40

40 60

60

60 80

80

80 100

100

SNR (adjusted)

Successful−Selection rates by [VB]

1.2 1.5 2.0 2.5 3.0 3.5 4.0 8 16

25 50 75 100 200 400 800

40

40 60

60

60 80

80

80 100

100

SNR (adjusted)

Successful−Selection rates by [BYYo]

1.2 1.5 2.0 2.5 3.0 3.5 4.0 8 16

25 50 75 100 200 400 800

40

40 60

60

60 80

80

80 100

100

SNR (adjusted)

Successful−Selection rates by [BYYp]

1.2 1.5 2.0 2.5 3.0 3.5 4.0 8 16

25 50 75 100 200 400 800

40

40 60

60

60 80

80

80 100

100

SNR (adjusted)

Successful−Selection rates by [BYYph]

1.2 1.5 2.0 2.5 3.0 3.5 4.0 8 16

25 50 75 100 200 400 800

Fig. 1. Contour maps of successful selection rates of all criteria are drawn against to adjusted axes (i.e., equal space among setting values). A red asterisk (∗) indicates the corresponding criterion gets the highest successful selection rate at that setting.

[8] J Rissanen, “Modelling by the shortest data description,” Au- tomatica, vol. 14, pp. 465–471, 1978.

[9] Thomas P. Minka, “Automatic choice of dimensionality for PCA,” in Advances in Neural Information Processing Systems 13, 2001, pp. 598–604.

[10] Matthew J. Beal, Variational Algorithms for Approximate Bayesian Inference, Ph.D. thesis, Gatsby Computational Neu- roscience Unit, University College London, 2003.

[11] Lei Xu, “Bayesian-Kullback coupled Ying-Yang machines:

Uniﬁed learnings and new results on vector quantization,” in International Conference on Neural Information Processing (ICONIP), 1995, pp. 977–988.

[12] E. Fishler, M. Grosmann, and H. Messer, “Detection of signals by information theoretic criteria: general asymptotic per- formance analysis,” IEEE Transactions on Signal Processing, vol. 50, no. 5, pp. 1027–1036, May 2002.

[13] Shikui Tu and Lei Xu, “Theoretical analysis and comparison of several criteria on linear model dimension reduction,” in ICA

’09: Proceedings of the 8th International Conference on In- dependent Component Analysis and Signal Separation, Berlin, Heidelberg, 2009, pp. 154–162, Springer-Verlag.

[14] Lei Xu, “Bayesian Ying-Yang learning theory for data dimen- sion reduction and determination,” Journal of Computational Intelligence in Finance, vol. 6, no. 5, pp. 6–18, 1998.

[15] Shikui Tu and Lei Xu, “On the two parameterizations of factor analysis: which one is better?,” (In preparation), 2009.

[16] Lei Xu, “Bayesian Ying Yang Learning,” in Schol- arpedia 2(3):1809, http://scholarpedia.org/article/Bayesian

Ying Yang Learning, 2007.

[17] Lei Xu, “Bayesian Ying-Yang System, Best Harmony Learn- ing, and Five Action Circling,” to appear in an invited spe- cial issue on Emerging Themes on Information Theory and Bayesian Approach, Frontiers of Electrical and Electronic En- gineering in China, a journal jointly published by Higher Edu- cation Press of China and Springer, 2010.

20

20 40

40

40 60

60

60 80

80

80 100

100

SNR (adjusted)

Successful−Selection rates by [DNLL]

1.2 1.5 2.0 2.5 3.0 3.5 4.0 8 16

25 50 75 100 200 400 800

Fig. 2. Continue Fig.1 for DNLL.

1969