National University of Kaohsiung Repository System:Item 310360000Q/10492

全文

(1)國立高雄大學統計學研究所碩士論文. An Efficient Stochastic Matching Pursuit Algorithm and its Applications 一個有效率的隨機匹配追蹤演算法及其應用. 研究生：蔡雨潔撰指導教授：陳瑞彬教授. 中華民國. 九十九年. 七月.

(2) 致謝辭隨著學期的結束，我的研究所生活也將告一段落。這兩年中發生許許多多的事情，首先要感謝我的指導教授. 陳瑞彬老師，謝謝您總是耐心的指導，讓碩一. 懵懵懂懂的我，能夠一步一步完成論文，也學到了許多的事情，讓我受益良多。也謝謝所上每位教授的認真教學，讓我在課業上有很大的收穫。另外也要感謝基祥學長和怡如學姐，總是在我遇到問題時給予意見，幫助我解決問題。在生活上要感謝的就是蘭屏姐，從我到高大開始就不斷的協助我們處理各項事務，也會適時給予意見修正我們的錯誤。還有要感謝各個都是寶的碩班同學們，豐富了我的碩班生涯，也帶給我很多驚喜和驚奇，謝謝你們陪我吃飯、運動，很開心有你們！最後感謝我的家人，總是支持我的任何決定讓我做我想做的事。也謝謝一路陪伴我的朋友，願意聽我分享我的生活大小事，因為你們我會更開心的生活，我愛你們！蔡雨潔撰於高雄大學統計學研究所民國 99 年 7 月.

(3) An Efficient Stochastic Matching Pursuit Algorithm and its Applications. by Yu-Chieh Tsai Advisor Ray-Bing Chen. Institute of Statistics National University of Kaohsiung Kaohsiung, Taiwan 811, R.O.C. July 2010.

(4) Contents. Z` zZ`. ii iii. 1 Introduction. 1. 2 Efficient Stochastic Matching Pursuit algorithm. 2. 2.1. Statistical Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 2. 2.2. Stochastic matching pursuit algorithm . . . . . . . . . . . . . . . . . . . .. 3. 2.2.1. Prior assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 3. 2.2.2. Componentwise Gibbs sampler . . . . . . . . . . . . . . . . . . . . .. 3. 2.2.3. Stochastic matching pursuit . . . . . . . . . . . . . . . . . . . . . .. 5. 2.3. Full Bayesian approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 7. 2.4. Efficient stochastic matching pursuit . . . . . . . . . . . . . . . . . . . . .. 9. 2.4.1. 9. 2.5. Computational cost . . . . . . . . . . . . . . . . . . . . . . . . . . .. Selection criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10. 3 Denoising with the Unions of Orthogonal Bases. 11. 3.1. The denoising problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11. 3.2. Bayesian algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12. 3.3. Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13. 3.4. A real example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18. 4 Conclusion. 19. References. 20. i.

(5) ×Íb[£Ý ^<g_·Õ°CÍTà ¼0>0: Wü}ÿ »ñ{..Ù.@~X .ß: |$ »ñ{..Ù.@~X ` ^<g_·Õ°3hóóãb?Ý[39S¡Z´&Æè×Í ^<g_·Õ°ÝhÌÍ#½+Û×ÍjEhÕ°?b[£ÝÕ]°Q¡ &ÆÞh ^<g_·Õ°Tà3Î¯Ý®Þî¿à×°ÿaõ@jÝÎ¯»¼ JhÕ°Ý[ n"Þ: ¾úhødZû¿úÕ°. ii.

(6) An Efficient Stochastic Matching Pursuit Algorithm and its Applications. Advisor: Dr. Ray-Bing Chen Institute of Statistics National University of Kaohsiung. Student: Yu-Chieh Tsai Institute of Statistics National University of Kaohsiung. ABSTRACT The stochastic matching pursuit algorithm, proposed by Chen et al. (2009), performs well in Bayesian variable selection. In this thesis, the full Bayesian version of the stochastic matching pursuit algorithm is first proposed and then an efficient computing approach of this algorithm is introduced. Next we focus on applying the stochastic matching pursuit algorithm in the denoising problem. Several simulations and real sound examples are used to demonstrate the performance of the algorithm. Keywords: Gibbs sampler, Metropolis algorithm, denoising.. iii.

(7) 1. Introduction. Suppose Y is a dependent variable of interest, and X1 , X2 , . . . , Xp are the independent variables. We are interested to model the relationship between Y and the independent variables, X1 , X2 , . . . , Xp . Since X1 , X2 , . . . , Xp contain some redundant or irrelevant variables, especially when p is large, the variable selection problem is to select the important variables, X1∗ , X2∗ , . . ., Xk∗ , k < p, from X1 , X2 , . . . , Xp to fit the variable Y . In this thesis, Bayesian approaches are used for variable selection. Chen et al. (2009) proposed an efficient Bayesian variable selection algorithm. Their algorithm is named as the stochastic matching pursuit algorithm. In Chen et al. (2009), the prior distribution of the coefficients are set to be a mixture of Dirac dalta function and a normal distribution. The Dirac dalta function is a point mass at 0 and the normal distribution is centered at 0 and has a large variance. For such a mixture distribution, we can include indicators to indicate which variables are selected or not. Chen et al. (2009) have shown that the stochastic matching pursuit algorithm performs well in variable selection problem for the cases of large n and small p situations. The stochastic matching pursuit algorithm is a Metropolis scheme with the birth and the death processes. The birth process is to add a new variable to the set of selected variables. The birth process is very similar to the matching pursuit algorithm in Mallat and Zhang (1993), because the variables that have larger correlations with the residual vector, would be assigned higher probabilities of being added. The death process is to randomly delete a variable from the set of selected variables. However, in Chen et al. (2009), some parameters in their algorithm must be pre-specified. Hence the first goal of this thesis is to extend the stochastic matching pursuit algorithm as a full Bayesian approach, i.e. to sample all the parameters from their own posterior distributions. Then we focus on how to efficiently implement this Markov chain Monte Carlo (MCMC) algorithm. In the second part of this thesis, the denosing problem is considered. Févotte and Godsill (2006) studied the denoising problem via a Bayesian selection approach. In their work, the set of independent variables are chosen as the union of two orthonormal bases. Here we treat each set of the orthonormal bases as a individual window, and then the window version of the stochastic. 1.

(8) matching pursuit algorithm can be applied to solve denoising problems. Several simulations and real sound examples are used to demonstrate the performances of the stochastic matching pursuit algorithm in this denosing problem. This thesis is organized as follows. In Section 2, we first introduce MCMC algorithms proposed in Chen et al.(2009). Then we extend these two algorithms as full Bayesian algorithms. We also show how to implement stochastic matching pursuit algorithm more efficiently. In Section 3, the signal denoisng problem is introduced. Then we apply the window version of the stochastic matching pursuit algorithm to the denoisng problem, and compare the performance with the algorithm proposed in Févotte and Godsill (2006) via several simulations and real sound examples.. 2. Efficient Stochastic Matching Pursuit algorithm. In this section, the stochastic matching pursuit algorithm in Chen et al. (2009) is modified to be a full Bayesian approach, and a more efficient version is proposed. We first introduce the statistical model considered here. Then we describe the full Bayesian stochastic matching pursuit algorithm briefly and show how to improve the efficient of this algorithm. At the end of this section, we demonstrate the efficiency of this new version via several simulations.. 2.1. Statistical Model. In this thesis, our model assumption is the simple linear model, i.e. Y = Xβ + ε,. (1). where the response Y is an n × 1 vector, X = [X1 , X2 , · · · , Xp ] is an n × p model matrix, and 0. Xi is the i-th independent variable. Here β = (β1 , β2 , · · · , βp ) is a p × 1 unknown coefficient 0. vector, and ε = (ε1 , ε2 , · · · , εn ) is an n × 1 multivariate normal distribution with zero mean and covariance matrix σ 2 In , where In is n × n identity matrix. In the variable selection problem, we want to choose the “best” subset of variables from X1 , X2 , · · · , Xp to fit the response Y . To show which variables are included in the model, we introduce a latent variable γi ∈ {0, 1} which is an indicator variable to denote Xi is selected or not. That is when γi = 1, Xi is selected;. 2.

(9) and when γi = 0, Xi is not selected. Hence the p × 1 vector γ = (γ1 , γ2 , · · · , γp ) indicates which variables are selected in the model.. 2.2. Stochastic matching pursuit algorithm. In this subsection, we introduce two Bayesian selection algorithms, the componentwise Gibbs sampler and the stochastic matching algorithm that are proposed by Chen et al. (2009). Before introducing the algorithms, the prior assumptions are described first. 2.2.1. Prior assumptions. According to Chen et al. (2009), the prior assumptions of β, γ, and σ are chosen as follows. First the (γi , βi ) are assumed independent for i= 1,. . . , p. A Bernoulli prior is used for γi , and P (γi = 1) = pi ,. P (γi = 0) = 1 − P (γi = 1) = 1 − pi .. (2). The prior of the coefficient, βi , is depend on γi , and is defined the following mixture distribution, £. ¤ βi | γi , τi2 ∼ (1 − γi )δ0 + γi N (0, τi2 ).. (3). where δ0 is the Dirac delta function and τi is the variance of normal prior when γi = 1. Without loss of generality, let ρ = pi = P (γi = 1), i = 1, 2, . . . , p, because we treat each variable has the same probability to be included into the model, and to simplify the notation, Chen et al. (2009) also set the τ = τi for i=1, . . . , p. Thus based on equation (3), we have βi ∼ (1 − γi )δ0 + γi N (0, τ 2 ).. (4). Finally, the variance σ 2 is given an inverse Gamma prior σ 2 ∼ IG(ν/2, νλ/2). 2.2.2. Componentwise Gibbs sampler. In the componentwise Gibbs sampler, γi is drawn by fixing the other (γj , βj ), ∀ j 6= i. If the basis is active, i.e. γi = 1, then the corresponding coefficient βi is sampled from its posterior distribution. Otherwise βi is set to be zero. That is the componentwise Gibbs sampler samples the pair, (γi , βi ), at one time. The key step of the componentwise Gibbs sampler is to compute. 3.

(10) the likelihood ratio P (Y |γi = 1, βk , ∀k 6= i) zi = = P (Y |γi = 0, βk , ∀k 6= i) σ2τ 2 0 Xi Xi τ 2 + of γi is. 2 = where σi?. probability. σ. 2,. ri =. r. 2 σi? ri2 exp{ 2 }, i = 1, . . . , p τ2 2σi?. (5). P Ri0 Xi τ 2 0 2 , and Ri = Y − k6=i βk Xk . Then the posterior σ + Xi Xi τ 2. P (γj = 1|{βi , ∀i 6= j}, Y ) =. ρzj . (1 − ρ) + ρzj. 2 ); otherwise, β is set to be 0. The componentwise Gibbs If γi = 1, then we sample βi ∼ N (ri , σi? i. sampler algorithm in Chen et al. (2009) is shown in Algorithm 1. In fact, the similar algorithm has been also proposed in Geweke (1996) by setting the prior of βi |γi = 1 as the truncated normal distribution. Algorithm 1. The componentwise Gibbs sampler algorithm. (I) Randomly select a regressor, X j . Compute R j = Y − 2 = σj?. σ2 τ 2 , 0 Xj Xj τ 2 + σ 2. rj =. 0 Rj Xj τ 2 0 Xj Xj τ 2 +. (II) Compute zj =. P (Y | γj =1, {βi , ∀i6=j}) P (Y | γj =0, {βi , ∀i6=j}). q =. 2 σj? τ2. σ2. P i6=j. βi Xi , and let. .. r2. exp{ 2σj2 }. j?. Then evaluate the posterior probability P (γj = 1|{βi , ∀i 6= j}, Y ) =. ρzj (1−ρ) + ρzj .. (III) Sample γj from the above posterior probability. If γj = 0, then set βj = 0, 2 ). Go back to (I). After the N th iteration, otherwise, sample βj ∼ N (rj , σj? k. go to (IV). (IV). 0 P Res Res+νλ Compute Res = Y − i βi Xi , then sample σ 2 ∼ IG( n+ν , ). 2 2. Go back to (I).. One thing needs to note that the variables can be visited by a systematic scan or random order in Algorithm 1, and the variance, σ 2 , is updated fewer times than the coefficients and indicators. 4.

(11) 2.2.3. Stochastic matching pursuit. There are two possible problems for the componentwise Gibbs sampler. If variables are highly correlated or the residual variance is too small, then an inferior variable might be visited first and be selected. To overcome these two problems, Chen et al. (2009) proposed the stochastic matching pursuit by combining the idea of matching pursuit (Mallat and Zhang, 1993). The stochastic matching pursuit algorithm is a Metropolis scheme. It consists the birth and death processes, i.e. it includes the two possible phases in each iteration. The birth process is to add a variable to active set of variables, and in the death process a variable is deleted from the active set of variables. The component Gibbs sampler can be considered as that it only has the birth process. With the probability pbirth , we propose to go to the birth process. With the probability pdeath = 1 − pbirth , we propose to go to the death process. The details of the birth process is shown as follows. First we compute zi =. P (Y |γi =1,βk ,∀k6=i) P (Y |γi =0,βk ,∀k6=i). as described in (I) and (II) of the. componentwise Gibbs sampler algorithm among all the inactive variables. If the zi is larger, the probability of the variable i is selected is increasing. In the Metropolis scheme, we sample a variable i from the group of inactive variables with the probability proportional to zi . The P proposal probability of selecting the variable i to the set of active variables is zi / i:γi =0 zi , and we also sample βi by the step (III) of Algorithm 1 at the same time. The death process, which deletes an active variable, is to select a variable randomly among all the active variables. Once the delete proposal is accepted the corresponding γi and βi are set to zero. The Metropolis scheme consists of a pair of reversible moves, birth and death, we present how to calculate the acceptance probability of the birth process and the acceptance probability of the death process. Let A be the number of active variables. The acceptance probability for adding variable i whose current γi = 0 is " # P (γi = 1|{βk , ∀k 6= i}, Y ) pdeath 1/(A + 1) P paccept-birth = min 1, P (γi = 0|{βk , ∀k 6= i}, Y ) pbirth zi / j:γj =0 zj " # ρzi pdeath 1/(A + 1) P = min 1, (1 − ρ) pbirth zi / j:γj =0 zj # " P ρ pdeath j:γj =0 zj , = min 1, (1 − ρ) pbirth (A + 1) 5. (6).

(12) and the acceptance probability for deleting variable i whose current γi = 1 is " # P P (γi = 0|{βk , ∀k 6= i}, Y ) pbirth zi /( j:γj =0 zj + zi ) paccept-death = min 1, P (γi = 1|{βk , ∀k 6= i}, Y ) pdeath 1/A " # (1 − ρ) pbirth A P = min 1, . ρ pdeath j:γj =0 zj + zi. (7). When p is large, it might cost more computational time to select add a variable by comparing with all inactive variables. So the window version is a possible method to improve this algorithm. We choose the windows of variables before running the algorithm and the different windows can overlap with each other. Then within each iteration of the algorithm, we choose a window and implement the birth and death process only for the variables within this window. The stochastic matching pursuit algorithm in Chen et al. (2009) with the window version is displayed in Algorithm 2. Algorithm 2. The window version of the stochastic matching pursuit algorithm. (I) Randomly select a window W, let A be the number of active bases within W, and let B be the number of inactive bases within W. With probability p birth , go to (II). With probability p death = 1 - p birth go to (IV). (II) With probability p accept−birth calculated according to (6), go to (III), with probability 1 − paccept−birth go back to (I). (III) Among all the inactive variables j ∈ W and γj = 0, sample a base j with probability proportional to zj , then let γj = 1 and sample βj as described in (III) of Algorithm 1. Go back to (I). (IV) If A > 0, then randomly select an active variable k with γk = 1 within W . (V) With probability paccept−death calculated according to (7), accept the proposal of deleting the base k, i.e., set βk = 0 . Go back to (I). With probability 1 − paccept−death , reject the proposal of deleting the variable k, and sample 6.

(13) 2 ). Go back to (I). After the N th iteration, go to (VI). βk ∼ N (rk , σk? k. (VI). Compute Res = Y -. P. i βi Xi ,. 0. Res Res+νλ then sample σ 2 ∼ IG( n+ν ). 2 , 2. Go back to (I).. 2.3. Full Bayesian approach. In this thesis, the full Bayesian approach of the stochastic matching pursuit algorithm is studied here. Since the two tuning parameters, ρ and τ 2 , is fixed in Chen et al. (2009), to make the stochastic matching pursuit algorithm as a full Bayesian algorithm, we add steps to sample these two parameters by setting the proper priors. These additional prior assumptions are chosen according to Févotte and Godsill (2006). First, the prior of ρ is set to be Beta(αρ , βρ ). The prior of τi2 is set as an inverse Gamma prior, i.e. τi2 ∼ IG(α, λτ ). Here we also sample the scale parameter λτ , whose the prior distribution is chosen as a Gamma prior, λτ ∼ Gamma(αλ , βλ ). Based on the prior assumptions, the posterior distribution of ρ is simply shown in the following,. p p X X ρ ∼ Beta( γ j + αρ , n − γj + βρ ). j=1. j=1. For parameter τi2 , its posterior distribution is β2 1 τi2 |γi ∼ (1 − γi )IG(α, λτ ) + γi IG( + α, i + λτ ). 2 2 We simply fix α = 1 in our code. The scale parameter λτ is drawn from the posterior distribution p P 1 λτ ∼ Gamma(nα + αλ , + βλ ). Then the full Bayesian version of the componentwise Gibbs τ2 i=1. i. sampler and stochastic matching pursuit algorithm are shown in Algorithm 3 and 4. Algorithm 3. The full Bayesian componentwise Gibbs sampler algorithm (I) Randomly select a regressor, X j . Compute R j = Y − 2 = σj?. σ 2 τj2 0 Xj Xj τj2 +. σ2. , rj =. 0 Rj Xj τj2 0 Xj Xj τj2 +. (II) Compute zj =. P (Y | γj =1, {βi , ∀i6=j}) P (Y | γj =0, {βi , ∀i6=j}). r =. 2 σj? τj2. σ2. .. r2. exp{ 2σj2 }. j?. 7. P i6=j. βi Xi , and let.

(14) Then the posterior probability P (γj = 1|{βi , ∀i 6= j}, Y ) =. ρzj (1−ρ) + ρzj .. (III) Sample γj from the above posterior probability. If γj = 0, then set βj = 0, τj2 ∼ IG(α, λτ ); 2 ), τ 2 ∼ IG( 1 + α, If γj = 1, then set βj ∼ N (rj , σj? j 2. βj2 2. + λτ ).. Go back to (I). After the Nkth iteration, go to (IV). (IV) (V). Sample λτ ∼ Gamma(nα + αλ , Compute Res = Y -. P. i βi Xi ,. P. 1 j τj2. P P + βλ ), and ρ ∼ Beta( j γj + αρ , n − j γj + βρ ). 0. Res Res+νλ then sample σ 2 ∼ IG( n+ν ). 2 , 2. Go back to (I).. Algorithm 4. The window version of the full Bayesian stochastic matching pursuit algorithm. (I) Randomly select a window W, let A be the number of active bases within W, and let B be the number of inactive bases within W. (II) With probability p birth , go to (III). With probability p death = 1 - p birth go to (V). (III) With probability p accept−birth calculated according to (6), go to (IV), with probability 1 − paccept−birth go back to (II). (IV) Among all the inactive variables j ∈ W and γj = 0, sample a base j with probability proportional to zj , then let γj = 1 and sample βj , τj2 as described in (III) of Algorithm 3. Go back to (II). (V) If A > 0, then randomly select an active variable k with γk = 1 within W . (VI) With probability paccept−death calculated according to (7), accept the proposal of deleting the base k, i.e., set βk = 0, τj2 ∼ IG( 21 + α, 8. βj2 2. + λτ ) . Go back to (II)..

(15) With probability 1 − paccept−death , reject the proposal of deleting the variable k, 2 ). Go back to (II). After the N th iteration, go to (VII). and sample βk ∼ N (rk , σk? k. (VII) Sample λτ , ρ as described in (IV) of Algorithm 3. Go back to (I). After the Mkth iteration, go to (VIII). (VIII). Compute Res = Y -. P. i βi Xi ,. 0. Res Res+νλ ). then sample σ 2 ∼ IG( n+ν 2 , 2. Go back to (I).. 2.4. Efficient stochastic matching pursuit. In the birth process of the stochastic matching pursuit algorithm, we need to re-calculate all the zj of the inactive variable j. In fact, when the birth proposal is rejected, the values of zj ’s are not re-calculated. However, for the death process, no matter if we delete an active variable or not, we must re-calculate zj ’s. Here to save the cost for compute zj , we have the following modification in our code. In the step (III) of Algorithm 2 and the step (IV) of Algorithm 4, when the variable is not included to the model in birth process, we keep the values of zj ’s instead of re-calculating all the zj . We believe that this modification would make our code more efficient especially when our posterior samples become stable, and the true model satisfies the sparse assumption. 2.4.1. Computational cost. Here we compare the efficiency of this new version code and the original code implemented by Chen and Chu (2007) via three simulations. All procedures are implemented by MATLAB. We run codes on a PC with 2.40GHz Intel Core 2 Quad CPU. We set up (n, p)=(100, 200), iid. (100, 300), and (100, 400). The regressors X1 , X2 , . . . , Xp ∼ N100 (0, I100 ), and the response is generated by Y = 3X1 − 3.5X2 + 4X3 − 2.8X4 + 3.2X5 + ε, where ε ∼ N100 (0, I100 ). We run 10,000 iterations for all simulations, and Table 1 shows the CPU time taken by two version of stochastic matching pursuit codes. We use OSMP and NSMP to denote the results of these two versions. 9.

(16) From Table 1, it is clearly the code of new version can reduce the computational cost. CPU time reduce 408 seconds when p is 200, and when p = 400, it take 1072 seconds fewer than that of OSMP. From Figure 1, it is clear that NSMP would save more cost when p is larger. Table 1: CPU time of 10,000 iterations OSMP. NSMP. Reduce. CPU time in (n, p) = (100, 200). 4798s. 4390s. 408s. CPU time in (n, p) = (100, 300). 8358s. 7809s. 549s. CPU time in (n, p) = (100, 400). 14073s. 13001s. 1072s. Figure 1: CPU time with the different p 14000. CPU time. 12000. 10000. 8000. 6000. 4000 200. OSMP NSMP. 300. 400. number of variables. 2.5. Selection criterion. The goal of Bayesian variable selection problem is to choice the best model based on posterior samples data. From MCMC algorithms, we can generate a sequence of γ that converges to π(γ|Y ), and then discard the first M of unstable γ’s. Therefore, the remaining MCMC samples are used to compute the posterior probabilities for possible models. The first selection criterion is to choose the model with the highest posterior probability, that is maximum P (γ1 , γ2 , . . . , γp |Y ). Here 2p posterior probability need to be computed. Thus this criterion might not be appropriate when p is large. Another criterion is the median probability criterion proposed in Barbieri and Berger (2004), and is introduced as follows. The posterior inclusion probability for variable i is πi = P (γi = 1|Y ). Then the median probability model, Mγ ∗ , is defined to the model existing of those variables whose posterior inclusion probability is 10.

(17) at least 1/2. That is, if πi ≥ 1/2 then γi∗ = 1 and γi∗ = 0 otherwise. In this thesis, we consider the situation with p is large, so we use the median probability criterion to selection the best model.. 3. Denoising with the Unions of Orthogonal Bases. In this section, the denoising problem in signal process is considered. We first describe the signal denoising problem in Févotte and Godsill (2006). Then we show how to apply the window version of the stochastic matching pursuit algorithm (SMP in short) for this denoising problem. Finally three simulations and a real example are used for comparing the results of our SMP algorithm and the algorithm in Févotte and Godsill (2006).. 3.1. The denoising problem. The model for denoising problem is written as the equation (1), i.e. Y = s + ε = Xβ + ε, where Y is the noisy signal, s is the clear signal and can be represented as s = Xβ, and ε is an additive Gaussian white noise of standard deviation σ. The goal of the signal denoising is to remove the noise with possibly high variance for estimate the clean signal. The modified discrete cosine transform (MDCT) is a popular choice for the basis of the signal model, since it provides an orthonormal decomposition without blocking effects. According to Malvar (1990) and Févotte et al. (2008), X(q,m) = [Φ(q,m) (1), Φ(q,m) (2), · · · , Φ(q,m) (n)]> is the vector of the orthonormal MDCT basis, X, that the corresponding Φ(q,m) (t), t = 1, . . . , n is a signal of length n = lf rame × mf rame , is defined as r lf rame + 1 π 1 4 Φ(q,m) (t) = h(i − (m − 1)lf rame ) × cos[ (i − (m − 1)lf rame + )(q − )], n lf rame 2 2 where q = 1, . . . , lf rame , m = 1, . . . , mf rame , and the window function h(i) = sin[ πn (i + 12 )]. Here we use k to replace (q, m), i.e Xk = [Φk (1), Φk (2), · · · , Φk (n)]> , k = 1, . . . , n. In Févotte and Godsill (2006), they studied the denoising problem of a piano sequence. The. 11.

(18) basis set X is the union of two MDCT bases. That is the signal model is modeled as Y. = X1 β1 + X2 β2 + ε, n n X X = β1,k X1,k + β2,k X2,k + ε, k=1. (8). k=1. where X1 and X2 are n × n orthonormal matrix, and Xi,k is the kth column of basis Xi . To choose these two orthonormal matrices X1 and X2 , we usually take X1 with long time resolution, lf rame , and high frequency resolution which aimed at representing the tonal parts of the signal. X2 is constructed with short time resolution and low frequency resolution, and is aimed at representing the transient parts of the signal.. 3.2. Bayesian algorithm. Févotte and Godsill (2006) studied the denoising problem via Bayesian selection approach. Since the basis set is constructed by two orthonormal bases set X1 and X2 , we define θ1 = {γe1 , βe1 , τe1 , λτ1 , ρ1 } for the parameters in the first block, and θ2 = {γe2 , βe2 , τe2 , λτ2 , ρ2 } for the parameters in the second block, where γ˜i = (γi,1 , γi,2 , . . . , γi,n ), β˜i = (βi,1 , βi,2 , . . . , βi,n ), and τ˜i = (τi,1 , τi,2 , . . . , τi,n ), i = 1, 2. The prior assumptions are chosen as follows. First the hierarchical prior is used for β˜i,k , i = 1, 2, k = 1, . . . , n, and £. ¤ 2 2 ∼ (1 − γi,k )δ0 + γi,k N (0, τi,k ), βi,k | γi,k , τi,k. 2 τi,k ∼ IG(αi , λτ˜i ).. The scale parameters λτ˜i for each basis is chosen as λτ˜i ∼ G(αλ , βλ ). Then the prior of the γi,k is set as a Bernolli prior, i.e. P (γi,k = 1) = ρi ,. P (γi,k = 0) = 1 − P (γi,k = 1) = 1 − ρi .. The probabilities ρi is set a Beta prior Beta(αρi , βρi ). Finally, the variance σ 2 is given an inverse Gamma prior σ 2 ∼ IG(ν/2, νλ/2). The prior assumptions in Févotte and Godsill (2006) are the same as what we show in Section 2.3. They proposed to sample the parameters {θ1 , θ2 , σ 2 } from their posterior distribution, using a Gibbs sampler. Since the Euclidean norm is invariant under rotation, the likelihood function. 12.

(19) of Y is 1 kY − X1 β˜1 − X2 β˜2 k22 2σ 2 n 1 = (2πσ 2 )− 2 exp − 2 kXT2 (Y − X1 β˜1 ) − β˜2 k22 2σ 1 2 −n = (2πσ ) 2 exp − 2 kXT1 (Y − X2 β˜2 ) − β˜1 k22 . 2σ n. p(Y |θ1 , θ2 , σ 2 ) = (2πσ 2 )− 2 exp −. That is conditional upon θ2 and σ 2 , sampling (˜ γ1 , β˜1 ) is a simple regression problem. Similarly, sampling (˜ γ2 , β˜2 ) is a simple regression problem by fixing θ2 and σ 2 . Thus sampler by drawing (˜ γ1 , β˜1 ) and (˜ γ2 , β˜2 ) alternately. The details of this algorithm for sampling (˜ γi , β˜i ) is as follows. Define Yi|−i to be Y1|2 = Y − X2 β˜2 or Y2|1 = Y − X1 β˜1 . First choose Xi , i = 1, 2, and randomly select a regressor Xi,j P in i. Then compute Ri|−i,j = Yi|−i − k6=j βi,k Xi,k and calculate the likelihood ratio zj with data Yi|−i as the step (II) of Algorithm 3. Then γi,j is drawn from the posterior conditional distribution of γi,j , with P (γi,j = 1|{βi,k , ∀k 6= j}, Yi|−i ) = 2 ) and τ sample βi,j ∼ N (rj , σj? i,j ∼. β2 IG( 12 + αi , 2i,j. ρi zj (1−ρi ) + ρi zj .. If γi,j = 1, then we. + λτ˜i ), otherwise, βi,j is set to be 0 and sample. τi,j ∼ IG(αi , λτ˜i ). After several iterations, update the hyperparameters, λi and ρi from their posterior distributions, then moving to the other basis. After visiting two bases, they update σ 2 from the posterior distribution. Thus, we also can treat the algorithm in Févotte and Godsill (2006) as a window version of the full Bayesian componentwise Gibbs sampler algorithm. Instead of the algorithm in Févotte and Godsill (2006), we can use the window version of the full Bayesian stochastic matching pursuit algorithm for this problem by setting Xi ’s as the pre-specified windows. Thus for the SMP, we choose one from two bases as the window and implement the birth and death process only in this window. If the window size is longer, we might run the birth and death process for the same window several times before moving to next window.. 3.3. Simulations. Here, WB CGS is used to denote the the componentwise Gibbs sampler algorithm in Févotte and Godsill (2006), and WB SMP is used for the window version of the full Bayesian stochastic matching pursuit. Three simulations are first employed. Based on three different models, whose active variables are selected from one of orthonormal basis set or both orthonormal basis sets. 13.

(20) Here for each simulation, we only iterate WB CGS and WB SMP 4000 times, because it takes long CPU time for our code, for example, for WB SMP with 4000 iterations, it is about 13000 seconds when we implement our code on a PC with 2.40GHz Intel Core 2 Quad CPU. Example 3.1 We set the number of observations, n, is 1000, so that the number of regressor variable p is 2000. Here X1 = [X1,1 , X1,2 , · · · , X1,1000 ] and X2 = [X2,1 , X2,2 , · · · , X2,1000 ] are two MDCT bases with two different time resolutions lf rame1 = 500 and lf rame2 = 250 respectively. The response variables is generated by the model Y = X1 β1 + X2 β2 + ε, where ε is the white Gaussian noise, and the active coefficients are chosen as (β1,1 , β1,2 , β1,3 , β1,4 , β1,5 , β1,996 , β1,997 , β1,998 , β1,999 , β1,1000 ) = (3, −2, 5.2, 3.2, 2.5, −2.8, 3.6, 4.2, 2.9, −5) and (β1,982 , β2,984 , β2,986 , β2,988 , β2,990 , β2,992 , β2,994 , β2,996 , β2,998 , β2,1000 ) = (4, −2.2, 5, −4.2, 2.9, −3.8, −2.8, 3.2, 3, −3.5). We add white Gaussian noise to the original response, Ytrue , according to several input signal-to-noise ratios (SNRs). The input SNR is defined as SN R = 10log10 (kYtrue k2 /kY − Ytrue k2 ), so that the smaller input SNR, the more noise is. In our simulations, the function “awgn” in MATLAB is used to add the corresponding noise according to pre-specify SNR value. In each case of input SNR, we run 4000 iterations of WB SMP and discard the first 3000 iterations. In each iteration of the stochastic matching pursuit, the parameters λ, ρ are after updating 10 variables and sample σ 2 after sampled WB SMP visits every 100 variables. Hence we use the last 1000 iterations to compute the posterior probability. The simulation results with 10 replications are shown in Table 2. In Table 2, output SNR, Rz2 , and R2 are defined as SN Rout = 10log10 (kYtrue k2 /kYˆ − Ytrue k2 ), Rz2 = 1 − (kYˆ − Ytrue k2 /kYtrue k2 ), and R2 = 1 − (kYˆ − Y k2 /kY k2 ), 14.

(21) where Yˆ = X1,in βˆ1,in + X2,in βˆ2,in , Xi,in , i = 1, 2 is the set of Xi,j which is selected in the model, and βî,in is mean of β1,in estimated by the posterior. Then “Mean” is the mean of 10 replications and “Std” is the standard deviation of 10 replications. Table 2: Simulation results in Example 3.1 Mean (Std) Input SNR WB CGS. SN Rout Rz2 R2. WB SMP. SN Rout Rz2 R2. 0. 5. 10. 15. 20. 10.75491. 19.82738. 25.23381. 30.311135. 32.53475. (1.19535). (1.94599). (1.75802). (1.71331). (4.87849). 0.91302. 0.98876. 0.99678. 0.99901. 0.99839. (0.02423). (0.00411). (0.00109). (0.00038). (0.00379). 0.5343. 0.77404. 0.91071. 0.96899. 0.98908. (0.05900). (0.00932). (0.00647). (0.00183). (0.00360). 10.65114. 20.03369. 25.26903. 31.18720. 33.81311. (1.66694). (2.73985). (2.37544). (1.64604). (2.60453). 0.90782. 0.98813. 0.99658. 0.99921. 0.9995. (0.03787). (0.00784). (0.00209). (0.00028). (0.00033). 0.50423. 0.77331. 0.91051. 0.96894. 0.99017. (0.02883). (0.00887). (0.00637). (0.00188). (0.00055). Note: “WB CGS” is used to denote the the algorithm in Févotte and Godsill (2006) and “WB SMP” is used for the window version of the full Bayesian stochastic matching pursuit. “Mean” is the mean of 10 replications and “Std” is the standard deviation of 10 replications.. From Table 2, if the output SNR is larger, it means the estimate signal, Yb , is closer to the clean signal. Table 2 shows that when the input SNR is increasing, then the output SNR is increasing. Therefore the signal is added less noise, the estimate signal is better. The standard deviation in Table 2 are between 1 and 3, but the standard deviation of WB CGS with input SNR is 20 is larger than the others, since one of 10 replications selects too many variables. Compare WB SMP with WB CGS, the output SNR of WB SMP is larger than the output SNR of WB CGS besides input SNR is 0. But the different in output SNR and Rz2 between WB SMP and WB CGS is small in all situations. In this example, both results are similar. 15.

(22) Example 3.2 This example consider the active variables are all in the second basis. We set (n, p)=(1000,2000) and two MDCT bases are the same setup as Example 3.1. We generate the response variable by Y = X1 β1 + X2 β2 + ε, where ε is the white Gaussian noise, and the active coefficients are chosen as (β2,982 , β2,984 , β2,986 , β2,988 , β2,990 , β2,992 , β2,994 , β2,996 , β2,998 , β2,1000 ) = (4, −2.2, 5, −4.2, 2.9, −3.8, −2.8, 3.2, 3, −3.5). With the same method of WB SMP as Example 3.1, we iterate 4000 times for posterior sampling of WB SMP and keep the last 1000 iterations for computing the posterior probability. Table 3 shows the simulation results with 10 replications in Example 3.2. Table 3: Simulation results in Example 3.2 Mean (Std) Input SNR WB CGS. SN Rout Rz2 R2. WB SMP. SN Rout Rz2 R2. 0. 5. 10. 15. 20. 19.28007. 24.16099. 29.51456. 34.98641. 32.70604. (3.33625). (3.85708). (2.95538). (1.78472). (2.40116). 0.98487. 0.99459. 0.99862. 0.99967. 0.99938. (0.01132). (0.00484). (0.00099). (0.00011). (0.00040). 0.51044. 0.76611. 0.90951. 0.96996. 0.98988. (0.02154). (0.00876). (0.00467). (0.00154). (0.00043). 18.91424. 24.50630. 29.59626. 35.04614. 37.68110. (3.42138). (3.42028). (2.53380). (1.82185). (3.12685). 0.98358. 0.99541. 0.99874. 0.99966. 0.99974. (0.01131). (0.00323). (0.00068). (0.00013). (0.00011). 0.50992. 0.76495. 0.90824. 0.96996. 0.99008. (0.02168). (0.00898). (0.00253). (0.00154). (0.00040). Note: “WB CGS” is used to denote the the algorithm in Févotte and Godsill (2006) and “WB SMP” is used for the window version of the full Bayesian stochastic matching pursuit. “Mean” is the mean of 10 replications and “Std” is the standard deviation of 10 replications.. From Table 3, first we find that the output SNR is 32.70604 with input SNR is 20 is smaller than output SNR is 34.98641 with input SNR is 15 in WB CGS. In general, the more noise 16.

(23) add to model, the output SNR is larger. This situation is overfitting in input SNR is 20, since most replications select too many variables to the model. Then comparison between WB CGS and WB SMP, WB SMP is larger than WB CGS besides the input SNR is 0 and the standard deviation are close. WB SMP is better than WB CGS in this example. Example 3.3 In this example, we consider the active variables are all in the first basis, (n, p) is also set to be (1000,2000). The two bases, X1 and X2 , follow the same setup in the Example 3.1, then the response is generated by the following model Y = X1 β1 + X2 β2 + ε, where ε is the white Gaussian noise, and the active coefficients are chosen as (β1,1 , β1,2 , β1,3 , β1,4 , β1,5 , β1,996 , β1,997 , β1,998 , β1,999 , β1,1000 ) = (3, −2, 5.2, 3.2, 2.5, −2.8, 3.6, 4.2, 2.9, −5).. WB SMP. algorithm are performed the same as what we do in Example 3.1. We run 4000 iterations of WB SMP, then the remaining 1000 iterations are used to compute the posterior probability. We show the simulation results with 10 replications in the Table 4. From Table 4, when input SNR is increasing, the output SNR and Rz2 is also increasing in both algorithm. Then compare WB SMP with WB CGS, the output SNR of WB SMP is larger than that of WB CGS when input SNR are 0, 5 and 10, but the output SNR of WB CGS and WB SMP are close here. In the others input SNR, the different in output SNR between WB SMP and WB CGS is larger than 1 and the output SNR of WB CGS is larger than that of WB SMP. The standard deviation of WB SMP are larger than that of WB CGS, so WB CGS is more stable than WB SMP. Therefore WB CGS is better than WB SMP in this simulation.. 17.

(24) Table 4: Simulation results in Example 3.3 Mean (Std) Input SNR WB CGS. SN Rout Rz2 R2. WB SMP. SN Rout Rz2 R2. 0. 5. 10. 15. 20. 16.54152. 23.77545. 28.12531. 35.69682. 40.50474. (2.51481). (2.75727). (3.13804). (3.14162). (1.86400). 0.97444. 0.99496. 0.99814. 0.99966. 0.99990. (0.01418). (0.00339). (0.00103). (0.00027). (0.00005). 0.52173. 0.76026. 0.91106. 0.96931. 0.99009. (0.02973). (0.01046). (0.00428). (0.00135). (0.00038). 17.362125. 23.80016. 28.149815. 34.265165. 38.51628. (1.86323). (2.73346). (3.04860). (3.77314). (2.76711). 0.97984. 0.99501. 0.99817. 0.99950. 0.99983. (0.01041). (0.00332). (0.00100). (0.00038). (0.00009). 0.51091. 0.76009. 0.91090. 0.96918. 0.98998. (0.01662). (0.01008). (0.00447). (0.00130). (0.00026). Note: “WB CGS” is used to denote the the algorithm in Févotte and Godsill (2006) and “WB SMP” is used for the window version of the full Bayesian stochastic matching pursuit. “Mean” is the mean of 10 replications and “Std” is the standard deviation of 10 replications.. 3.4. A real example. In this subsection, we present an real example for denoising problem of a piano sequence. We intercept a melodic piano sequence in Févotte and Godsill (2006) of length N = 5000. The two MDCT bases, X1 and X2 , are used here. X1 =[X1,1 , X1,2 , . . . , X1,5000 ] is the MDCT basis with lf rame1 = 2500 and X2 =[X2,1 , X2,2 , . . . , X2,5000 ] is the MDCT basis with lf rame2 = 1250. We add white Gaussian noise by the function “awgn” in MATLAB which can be used to add the corresponding noise according to pre-specify SNR values. We center the response Y before applying the algorithms, then WB CGS and WB SMP are performed as follows. In each case of input SNR, we run 1000 iterations and discard the first unstable 700 iterations. The reason for such few iterations of the MCMC algorithm is that it takes long time (around 450000 seconds) for 1000 iterations, by using a PC with 2.67GHz Intel Core i7 CPU. In each 18.

(25) iteration, WB CGS is used the systematic scans. For WB SMP, we update the parameters λ, ρ after random scanning 10 variables and then we update σ 2 after repeating the above step 100 times. So that it visits 1000 variables in each iteration. Finally we use the last 300 iterations to compute the posterior probability. The results of both algorithm are shown in Table 5. Table 5: The results in real example Input SNR WB CGS. WB SMP. 0. 5. 10. 15. 20. SN Rout. 13.1818. 15.6261. 19.9348. 22.7051. 27.9728. Rz2. 0.9519. 0.9726. 0.9898. 0.9946. 0.9984. R2. 0.507. 0.7557. 0.907. 0.9669. 0.9896. SN Rout. 13.4351. 16.6892. 19.8517. 22.0753. 26.569. Rz2. 0.9547. 0.9786. 0.9896. 0.9938. 0.9978. R2. 0.5033. 0.7506. 0.9047. 0.9651. 0.9889. Note: “WB CGS” is used to denote the the algorithm in Févotte and Godsill (2006) and “WB SMP” is used for the window version of the full Bayesian stochastic matching pursuit.. In Table 5, the Rz2 of two algorithm is larger than 95% in all situations, that mean the estimate signal is approach to the clean signal. We know that both algorithm perform well in this real example. From the output SNR in Table 5, we can see WB SMP is larger than WB CGS if input SNR are 0 and 5, and WB CGS is larger than WB SMP in the others situation. Therefore, WB CGS is not better than WB SMP absolutely under this five situations. Since the different in output SNR of WB CGS and WB SMP are small, so WB SMP is as well as WB CGS.. 4. Conclusion. In this thesis, we modify two Bayesian variable selection procedures in Chen et al. (2009) to be the full Bayesian procedures. Since the parameters, ρ and τ 2 , is fixed in Chen et al. (2009), we sample this two parameters in our procedures to avoid being affected the selection results. Moreover, we also suggest a coding approach to increase the efficiency of the stochastic matching pursuit algorithm. From Table 1 and Figure 1, our new approach can save the computational cost, especially when p is large.. 19.

(26) We apply our full Bayesian algorithm to the denoising problems by following the basis structure as the union of two MDCT bases in Févotte and Godsill (2006). From our simulation results, the two window versions work well since the all values of Rz2 are larger than 90%. Comparing WB SMP with WB CGS, WB SMP and WB CGS might perform similarly. In the the denoising example of a piano sequence, the results of WB SMP are as well as that of WB CGS. Finally, the window version of the stochastic matching pursuit algorithm is easily extended to more than two orthonomal bases by treating them as individual windows.. Acknowledgement We are grateful to the National Center for High-performance Computing for computer time and facilities.. References [1] Barbieri, M., and Berger, J. O. (2004). Optimal predictive model selection, Annals of Statistics, 32, 870-897. [2] Chen, R.-B., and Chu, C.-H. (2008). Bayesian variable selection for large p small n problems, M. s. Thesis, Institute of Statistics, National University of Kaohsiung, Kaohsiung, Taiwan. [3] Chen, R.-B., Chu, C.-H., Lai, T.-Y., and Wu, Y. N. (2009). Stochastic matching pursuit for Bayesian variable selection, Statistics and Computing. [4] Févotte, C., and Godsill, S. J. (2006). Sparse linear regression in unions of bases via Bayesian variable selection, IEEE Signal Processing Letter, 13, 441-444. [5] Févotte, C., Torrésani, B., Daudet, L., and Godsill, S. J. (2008). Sparse linear regression with structured priors and application to denoising of musical audio, IEEE Transactions on on Audio, Speech, and Language Processing, 16, 174-185. [6] Geweke, J. (1996). Variable selection and model comparison in regression. In: Bernardo, J. M., Berger, J. o., Dawid, A. P., and Smith, A. F. M.(eds.) Bayesian Statistics, 5, 609-620. 20.

(27) [7] Mallat, S. G., and Zhang, Z. (1993). Matching pursuit with time-frequency dictionaries, IEEE Transactions in Signal Process. 41, 3397-3415. [8] Malvar, H. S. (1990), Lapped transforms for efficient transform/subband coding, IEEE Transactions on on Acoustics, Speech, and Signal Processing, 38, 969-978.. 21.

(28)