2010 IEEE International
Conference on Acoustics, Speech, and Signal Processing
ICASSP 2010
March 14-19, 2010 • Dallas, Texas, U.S.A.
General Chair's Welcome Technical Program Overview Organizing Committee
Technical Program Committee Area Chairs
Reviewers Session Index Author Index Help
©2010 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.
IEEE Catalog Number: CFP10ICA-CDR ISBN: 978-1-4244-4296-6 ISSN: 1520-6149
Page 1 of 1 ICASSP 2010
2010-3-27
file://E:\ICASSP2010.html
A STUDY OF SEVERAL MODEL SELECTION CRITERIA FOR DETERMINING THE NUMBER OF SIGNALS
Shikui Tu and Lei Xu
∗Department of Computer Science and Engineering,
The Chinese University of Hong Kong, Shatin, N.T., Hong Kong. {sktu,lxu}@cse.cuhk.edu.hk
ABSTRACT
Addressing the problem of detecting the number of source signals as selecting the hidden dimensionality of Factor Anal- ysis (FA) model, we investigate several model selection cri- teria via a new empirical analyzing tool that examines the joint effect of signal-noise ratio (SNR) and sample sizeN on the model selection performance. The contours of the model selection accuracies visualize a three-region partition on the space of SNR andN , and a diminishing marginal effect which trades off SNR andN on the performance. Moreover, the newly derived Variational Bayes algorithm and three variants of Bayesian Ying-Yang (BYY) algorithms are more robust against reducing SNR andN , where the BYY with priors’
hyperparameters updated is the best in general.
Index Terms— Number of signals, hidden dimensional- ity, linear model, model selection, criteria
1. INTRODUCTION
It is an essential issue to detect the number of underlying source signals in many signal processing problems such as sensor array processing, the poles retrieval of a system response, the direction of arrival estimation by a smart an- tenna system, retrieving the overlapping echoes from radar backscatter and so on (see e.g., [1]). The observed vector can be modeled as a superposition of a finite number of underly- ing source signals with an additive noise. The source signals and noise vector sequence are assumed to be two independent ergodic zero-mean Gaussian random processes. Moreover, this issue can also be addressed as a model selection problem of selecting the hidden dimensionality of Factor Analysis (FA) [2] in its special case of Principal Component Analysis (PCA) [3] under the Maximum Likelihood principle.
A classical approach to model selection is the two-stage procedure, i.e., parameter learning is made on a set of can- didate models, among which one is selected by a model se- lection criterion. The existing criteria include Akaike’s Infor- mation Criterion (AIC)[4], Bozdogan’s Consistent Akaike’s
∗ Corresponding author: Lei Xu. Email: [email protected]. The work described in this paper was fully supported by a grant from the Research Grant Council of the Hong Kong SAR (Project No: CUHK4177/07E).
Information Criterion (CAIC)[5], Hannan-Quinn informa- tion criterion (HQC) [6], Schwarz’s Bayesian Information Criterion (BIC)[7] (which coincides with Rissanen’s Mini- mum Description Length (MDL)[8]), and recently developed Minka’s criterion (MK)[9], Variational Bayes (VB) [10], Bayesian Ying-Yang (BYY) harmony learning criterion[11].
Early from [1], AIC and MDL were introduced to deter- mine the number of signals, and then it was followed by a lot of researches such as [12], with emphasis on the asymptotic properties of the criteria under certain assumptions. Follow- ing the above track, this paper aims at a systematic investiga- tion on those criteria via a new empirical analyzing tool that examines the joint effect of the signal-noise ratio (SNR) and sample sizeN rather than the effect of either SNR or N with the other fixed in previous work, e.g., [13].
The adopted parameterization of FA is different from the common one in e.g., [10]. The two forms are equivalent in Maximum Likelihood learning, but different in model selec- tion as pointed out in [14] under BYY. Actually, the adopted form results in a better model selection ability under BYY and VB with details in another working paper[15]. Here, we im- plement the EM algorithm [3] for classical AIC etc. For VB, we derive a new VBEM algorithm by imposing appropriate priors on the unknown parameters. In addition to the existing prior-free version (BYYo)[16], BYY is further implemented by not only adopting the same priors (BYYp) as the VBEM but also updating the hyperparameters of the priors (BYYph) under the Hessian based second-order information conserva- tion principle [16].
By varying a wide range ofN and SNR in the empirical analysis, we connect the contour of the same model selection accuracy, and the contours actually define a family of model selection performance indifference curves (a term borrowed from economics) for each criterion. Then, we are able to re- veal a diminishing marginal effect that the amount of SNR (orN ) to trade for a unit of N (or SNR) increases if the per- formance is kept unchanged, and also able to present a three- region partition on the space ofN and SNR, i.e., all methods perform well/bad when SNR and N are too large/small re- spectively, while within the region with moderate SNR and N , the performances of these methods demonstrate diversity clearly. Moreover, VB and three variants of BYY outperform
the others in the region of diversity, while BYYph is the best in general.
The rest of the paper is organized as follows. Section 2 formulates the problem of determining the number of signals as estimating the hidden dimensionality of FA. Section 3 in- troduces the two-stage procedure with several model selection criteria whose behaviors are empirically analyzed in section 4 followed by the concluding remarks in section 5.
2. PROBLEM FORMULATION
In signal processing [1], a common model for the received complex-valued signal vectorx(t) from an array of n sensors at time instancet, isx(t) = As(t) + e(t), where A is the steering matrix with full column rank. The m-dimensional source signal vector sequence{s(t)} is assumed to be a sta- tionary and ergodic Gaussian random process with zero mean and positive definite covariance matrix Σs. The noise se- quence{e(t)} is assumed to be a stationary and ergodic Gaus- sian vector process, independent of the source signals, with zero mean and isotropic covariance matrixσ2eIn, whereInis then× n unit matrix. Determining the number of source sig- nals based on an observed sequence{x(t)}Nt=1is to estimate the rank ofAΣsAHinΣx|c=AΣsAH+ σe2InwhereΣx|c is the population covariance matrix of the received data, and the superscript “H” means the complex conjugate transpose.
On the other hand, a model called Factor Analysis (FA) in machine learning [16, 3] and statistics [2], assumes an ob- served real-valuedn-dimensional random variablex as fol- lows:
⎧⎪
⎪⎨
⎪⎪
⎩
x = Uy + μ + e, Θm={U, μ, Λ, Σe};
q(x|y) = G(x|Uy + μ, Σe), q(y) = G(y|0, Λ), q(x|Θm) =
p(x|y)p(y)dy = G(x|μ, Σx), Λ = diag[λ1, . . . , λm],Σx=UΛUT +Σe,
(1)
whereUTU = Im, andΣxis the population matrix of the data and Σe = σe2In which makes FA equivalent to PCA under the maximum likelihood principle [3]. Estimating the hidden dimensionalitym is to determine the rank ofUΛU in Σx=UΛUT+ σe2Inbased on{xt}Nt=1.
The two rank estimation problems are equivalent in a sense that they aim to estimate the (same) rank m in re- spectively two similar sample covariance equations. Next, we focus on the latter one, which is also widely used as a dimensionality reduction technique for feature extraction.
3. MODEL SELECTION CRITERIA
The two-stage procedure performs parameter learning over a set of candidate models among which one is selected by a model selection criterion. Typical examples include the classical AIC [4], BIC/MDL [7, 8], CAIC [5], HQC[6], and recent Minka’s criterion (MK) [9] and VB [10], and BYY [16, 17], as well as the difference of negative log-likelihood (DNLL). They are briefly summarized in Tab.1.
criteria Stage-I Stage-II:m = arg minˆ mJ(m) DNLL J(m) = −L( ˆΘMLm ) + L( ˆΘMLm−1) AIC J(m) = −L( ˆΘMLm ) + dm
BIC EM alg. J(m) = −L( ˆΘMLm ) +dm2 ln N CAIC J(m) = −L( ˆΘMLm ) +dm2 (ln N + 1) HQC J(m) = −L( ˆΘMLm ) + dmln(ln N) MK eig J(m) by equation (30) in [9].
VB VBEM J(m) = −F(ˆpU, ˆpν, ˆpφ, ˆpY, m) BYY eq.(2) J(m) = −H(pq, Θ∗m, Ξ∗) +12dm
Table 1. The two-stage procedures for several criteria are given, whereL( ˆΘMLm ) = maxΘmln q(XN|Θm) is the max- imized log-likelihood of data setXN, anddm = nm + 1−
m(m−1)
2 is the number of free parameters in FA. In Stage-I, the “EM alg.” denotes the Expectation Maximization (EM) algorithm for FA; the “eig” means estimating the sample eigenvalues for MK; theF(ˆpU, ˆpν, ˆpφ, ˆpY, m) is the resulted variational lower bound by VBEM; the H(pq, Θ∗m, Ξ∗) is the resulted harmony functional by implementing eq.(2).
One difficulty in Bayesian model selection is to com- pute the marginal likelihood which incorporates priors on the parameters and involves a high dimensional integra- tion. To approximate the marginal likelihood, Minka [9]
proposed a criterion (MK) via Laplace approximation. Vari- ational Bayes (VB) [10] is another way to approximate the (log) marginal likelihood with a lower bound by means of the variational methods. Since no VB algorithm exists for FA by eq.(1), we derive one in this paper by adopting a uniform prior over the Stiefel manifold used in [9] for U, i.e., q(U) = 2−m
iΓ(n−i+12 )π−n−i+12 , the commonly used Gamma density as priors for the precision parame- ters, i.e., q(ν|aν, bν) = m
i=1Γ(νi|aνi, bνi), q(ϕ|aϕ, bϕ) = Γ(ϕ|aϕ, bϕ), with ν = Λ−1 andϕ = (σe2)−1. Then, the VBEM algorithm implementsmax{pU,pν,pϕ,pY}F for each candidate scalem, andF is the variational lower bound:
F =
pUpνpϕpY ln
q(XN, Y|Θ)q(Θ|Ξ) pUpνpϕpY
dY dU dν dϕ, where the posterior is constrained to be in a factorized form of pUpνpϕpY, andq(XN, Y|Θ) =
tq(xt|yt)q(yt) is given by eq.(1), andq(Θ|Ξ) = q(U) q(ν|aν, bν)q(ϕ|aϕ, bϕ).
Firstly proposed in [11] and systematically developed over a decade [16, 17], Bayesian Ying-Yang (BYY) harmony learning theory is a general statistical learning framework that can handle both parameter learning and model selec- tion under a best harmony principle. Given in [16, 17], the general two-stage procedure of BYY harmony learning is summarized in Tab.1, and the Stage-I is to implement
I(a):Θ(τ)m = arg max /incrΘmH(pq, Θm, Ξ(τ−1)), I(b):Ξ(τ)= arg max /incrΞ{H(pq, Θ(τ)m , Ξ) +12d(Ξ)}, d(Ξ) =−dm+ (Θ(τ−1)m − Θ(τ)m )Ω(Θ(τ−1)m − Θ(τ)m ), (2) where Ω = ∇2ΘΘTH(pq, Θ(τ)m , Ξ). Specifically the har-
1967
band (%) very good not good
80 ∼ 100 most criteria DNLL, AIC
40 ∼ 80 BYY,VB BIC,CAIC,DNLL
0 ∼ 40 BYY,VB the rest criteria
low SNR VB{7 red ∗} the rest criteria get {1.5, 2.0} BYY{8 red ∗ = 1(BYYo)+ fewer than3 red ∗
3(BYYp)+4(BYYph)}
smallN BYYph gets most red∗ the rest criteria
<= 75 get few red∗
Table 2. The comparisons are based on (1) the band area between the specified contour lines(%) (the bigger and closer to left corner, the better),(2) the number of red asterisk (∗) (the more, the better).
mony functional for FA by eq.(1) is H(pq, Θm, Ξ) =
tln G(xt|0, Σx) +N ln
(2πe)n|Σy|x| + dr(W ) + ln q(Θ|Ξ), (3) wheredr(W ) = −T r[ΔTW(Σy|x)−1ΔWSN] will vanish as the differenceΔW = W − W converges to zero, i.e. the free parameter W converges to W =ΛUΣ−1x , withΣy|x = Λ−1+UTΣ−1e U and Σxgiven in eq.(1).
Ignoring priors by lettingq(Θ|Ξ) = 1, BYY (denoted as
“BYYo”) still possesses a good model selection ability[16].
By eq.(2), we further implement BYY (denoted as “BYYp”) by adopting the same priors as used in VB, so thatln q(Θ|Ξ) in eq.(3) plays a role of regularization. By I(b) in eq.(2), the hyperparameters are updated in BYY (named “BYYph”) to further increase the harmony functional. All BYY algorithms are implemented by the gradient method.
4. EMPIRICAL ANALYSIS
Empirical analysis is based on a series of controlled exper- iments by varying a wide range of the sample size N ∈ {25, 50, 75, 100, 200, 400, 800}, SNR γo = λσm∗2
e + 1 ∈ {1.2, 1.5, 2, 2.5, 3, 3.5, 4, 8, 16}, where λi= . . . = λm∗ = 1, and n, m∗ (i.e., dim(x), dim(y)) are respectively fixed at 15, 5 due to the space limit. For each of 102 independent runs, the two-stage procedure for every criterion is made on the set of candidate modelsM = {1, . . . , 9} based on a data setXNrandomly generated according to a chosen setting for N, γo. We report the percentages of the successful selections, i.e.,m = mˆ ∗, in the form of contour maps in Fig.1.
The contour maps define a family of model selection indifference curves that visualize the performance over the space of SNR andN . The performances decrease as N and SNR reduce. Also, it can be observed from Fig.1 that (1) a three-region partition, i.e., all criteria perform well/bad when SNR andN are large/too small, while the region with moderate SNR andN differentiates those criteria well; (2) a diminishing marginal effect, i.e., the amount of SNR (or N ) to trade for a unit loss of N (or SNR) increases as moving down an indifference curve.
Detailed observations from Fig.1,2 are listed in Tab.3. VB and three variants of BYY are relatively more robust than
the other criteria against reducing the sample size and SNR, where BYYph performs the best in general.
All methods are also evaluated on a real world dataset Pendigits1(16 attributes, 10 classes, 10992 instances). Sim- ilarly, we vary the training sample sizeN . The classifica- tion results basically coincide with the model selection per- formance on synthetic data.
% N = 16 N = 30 N = 100
AIC 56.46 ± 6.20 88.91 ± 1.72 96.61 ± 0.39 BIC 56.46 ± 6.19 87.72 ± 1.92 96.64 ± 0.40 HQC 56.46 ± 6.20 88.51 ± 1.59 96.62 ± 0.31 CAIC 48.34 ± 12.7 87.63 ± 1.62 96.63 ± 0.33 DNLL 79.18 ± 9.01 86.48 ± 1.71 91.26 ± 0.58 VB 87.02 ± 2.45 93.86 ± 1.31 96.19 ± 0.31 BYYph 88.57 ± 1.04 94.15 ± 0.33 96.29 ± 0.16 Table 3. Classification accuracies mean±stdev of 102runs.
5. CONCLUSION
Based on the problem of determining the number of underly- ing source signals, we have investigated the relative strengths and weaknesses of not only the classical AIC, BIC/MDL, CAIC, HQC, but also recently developed Minka’s criterion, VB and BYY. We derive a new VB algorithm for FA by im- posing appropriate priors, which are also adopted in BYY for further implementations. The investigation is made via a new empirical analyzing tool featured by model selection indiffer- ence curves which reveal a three-region partition and a dimin- ishing marginal effect. Moreover, the BYY with the priors’
hyperparaemters updated is the best in general.
6. REFERENCES
[1] M. Wax and T. Kailath, “Detection of signals by information theoretic criteria,” IEEE Trans. on Acoustics, Speech and Sig- nal Processing, vol. ASSP-33, no. 2, pp. 387, 1985.
[2] T.W. Anderson and H. Rubin, “Statistical inference in factor analysis,” in Proc. of third Berkeley symposium on mathemati- cal statistics and probability, 1956, vol. 5, pp. 111–150.
[3] Michael E. Tipping and Christopher M. Bishop, “Mixtures of probabilistic principal component analyzers,” Neural Compu- tation, vol. 11, no. 2, pp. 443–482, 1999.
[4] H. Akaike, “A new look at the statistical model identification,”
IEEE Transactions on Automatic Control, vol. 19, no. 6, pp.
716–723, Dec 1974.
[5] Hamparsum Bozdogan, “Model selection and Akaike’s Infor- mation Criterion (AIC): The general theory and its analytical extensions,” Psychometrika, vol. 52, no. 3, pp. 345–370, 1987.
[6] E. J. Hannan, A. J. McDougall, and D. S. Poskitt, “The de- termination of the order of an autoregression,” Journal of the Royal Statistical Society. Series B (Methodological), vol. 51, pp. 217–233, 1989.
[7] Gideon Schwarz, “Estimating the Dimension of a Model,” The Annual of Statistics, vol. 6, no. 2, pp. 461–464, 1978.
1from UCI repository: http://archive.ics.uci.edu/ml/datasets.html
20
20
20
20 40
40
40
40 60
60
60
60 80
80
80 80
80
80 80
SNR (adjusted)
Sample Size N (adjusted)
Successful−Selection rates by [AIC]
1.2 1.5 2.0 2.5 3.0 3.5 4.0 8 16
25 50 75 100 200 400 800
20
20
20
20 40
40
40
40 60
60
60
60 80
80
80
80 100
100
100
100
SNR (adjusted)
Sample Size N (adjusted)
Successful−Selection rates by [BIC]
1.2 1.5 2.0 2.5 3.0 3.5 4.0 8 16
25 50 75 100 200 400
800 20
20
20
20 40
40
40
40 60
60
60
60 80
80
80
80 100
100
100
100
SNR (adjusted)
Sample Size N (adjusted)
Successful−Selection rates by [CAIC]
1.2 1.5 2.0 2.5 3.0 3.5 4.0 8 16
25 50 75 100 200 400 800
20
20
20
20 40
40
40
40 60
60
60
60 80
80
80
80 80 100
100
100
SNR (adjusted)
Sample Size N (adjusted)
Successful−Selection rates by [HQC]
1.2 1.5 2.0 2.5 3.0 3.5 4.0 8 16
25 50 75 100 200 400 800
20
20
20
20 40
40
40
40 60
60
60
60 80
80
80
80
100
100
100
100
SNR (adjusted)
Sample Size N (adjusted)
Successful−Selection rates by [MK]
1.2 1.5 2.0 2.5 3.0 3.5 4.0 8 16
25 50 75 100 200 400 800
40 40
40 40
40 60
60
60
60 80
80
80
80 100
100
100
100
SNR (adjusted)
Sample Size N (adjusted)
Successful−Selection rates by [VB]
1.2 1.5 2.0 2.5 3.0 3.5 4.0 8 16
25 50 75 100 200 400 800
40
40
40 60
60
60
60 80
80
80
80 100
100
100
SNR (adjusted)
Sample Size N (adjusted)
Successful−Selection rates by [BYYo]
1.2 1.5 2.0 2.5 3.0 3.5 4.0 8 16
25 50 75 100 200 400 800
40
40
40 60
60
60
60 80
80
80
80 100
100
100
SNR (adjusted)
Sample Size N (adjusted)
Successful−Selection rates by [BYYp]
1.2 1.5 2.0 2.5 3.0 3.5 4.0 8 16
25 50 75 100 200 400 800
40
40
40 60
60
60
60 80
80
80
80 100
100
100
SNR (adjusted)
Sample Size N (adjusted)
Successful−Selection rates by [BYYph]
1.2 1.5 2.0 2.5 3.0 3.5 4.0 8 16
25 50 75 100 200 400 800
Fig. 1. Contour maps of successful selection rates of all criteria are drawn against to adjusted axes (i.e., equal space among setting values). A red asterisk (∗) indicates the corresponding criterion gets the highest successful selection rate at that setting.
[8] J Rissanen, “Modelling by the shortest data description,” Au- tomatica, vol. 14, pp. 465–471, 1978.
[9] Thomas P. Minka, “Automatic choice of dimensionality for PCA,” in Advances in Neural Information Processing Systems 13, 2001, pp. 598–604.
[10] Matthew J. Beal, Variational Algorithms for Approximate Bayesian Inference, Ph.D. thesis, Gatsby Computational Neu- roscience Unit, University College London, 2003.
[11] Lei Xu, “Bayesian-Kullback coupled Ying-Yang machines:
Unified learnings and new results on vector quantization,” in International Conference on Neural Information Processing (ICONIP), 1995, pp. 977–988.
[12] E. Fishler, M. Grosmann, and H. Messer, “Detection of sig- nals by information theoretic criteria: general asymptotic per- formance analysis,” IEEE Transactions on Signal Processing, vol. 50, no. 5, pp. 1027–1036, May 2002.
[13] Shikui Tu and Lei Xu, “Theoretical analysis and comparison of several criteria on linear model dimension reduction,” in ICA
’09: Proceedings of the 8th International Conference on In- dependent Component Analysis and Signal Separation, Berlin, Heidelberg, 2009, pp. 154–162, Springer-Verlag.
[14] Lei Xu, “Bayesian Ying-Yang learning theory for data dimen- sion reduction and determination,” Journal of Computational Intelligence in Finance, vol. 6, no. 5, pp. 6–18, 1998.
[15] Shikui Tu and Lei Xu, “On the two parameterizations of factor analysis: which one is better?,” (In preparation), 2009.
[16] Lei Xu, “Bayesian Ying Yang Learning,” in Schol- arpedia 2(3):1809, http://scholarpedia.org/article/Bayesian
Ying Yang Learning, 2007.
[17] Lei Xu, “Bayesian Ying-Yang System, Best Harmony Learn- ing, and Five Action Circling,” to appear in an invited spe- cial issue on Emerging Themes on Information Theory and Bayesian Approach, Frontiers of Electrical and Electronic En- gineering in China, a journal jointly published by Higher Edu- cation Press of China and Springer, 2010.
20
20
20
20 40
40
40
40 60
60
60
60 80
80
80 100
100
SNR (adjusted)
Sample Size N (adjusted)
Successful−Selection rates by [DNLL]
1.2 1.5 2.0 2.5 3.0 3.5 4.0 8 16
25 50 75 100 200 400 800
Fig. 2. Continue Fig.1 for DNLL.
1969