機器學習中的近似貝氏推論

(1)

AN OVERVIEW OF APPROXIMATE BAYESIAN

INFERENCE IN MACHINE LEARNING

Chiu-Hsing Weng

Department of Statistics, National Chengchi University

ABSTRACT

Machine learning is a scientiﬁc discipline that concerned with designing algorithms to automatically learn complex patterns based on data. Many of the machine learning methods rely on probabilistic models and treat the models in a Bayesian framework. Sampling methods such as Markov Chain Monte Carlo are popular and well known for approximate Bayesian inference. Alternatively, there are deterministic approximation techniques which have been successful in many applications. We present here some of the concepts and developments about deterministic approximation methods.

Key words and phrases: Approximate Bayesian inference; machine learning. JEL classiﬁcation: C110.

(2)

1. Introduction

Machine learning is a scientiﬁc discipline that concerned with designing algorithms to automatically learn complex patterns based on data. It integrates many approaches such as probability, optimization, statistics, control theory, etc. The applications in-clude forecasting, language processing, pattern recognition, games, text mining, data mining, etc.

Many of the machine learning methods rely on probabilistic models for variables. In a Bayesian framework, both the observed data D and the unknown model parameters θ are considered random quantities. Let p(θ) denote the prior distribution and p(D|θ) the likelihood. After observing D, Bayes theorem gives the posterior distribution of θ given D: p(θ|D) = p(θ, D) p(D) = p(θ, D) ∫ p(θ, D)dθ.

The posterior distribution of θ is useful for estimation and prediction. For example, quantities such as the posterior quantiles, posterior moments, etc are actually expec-tations of some functions g(θ) with respect to the posterior distribution:

E[g(θ)|D] =

∫

g(θ)p(θ|D)dθ;

and the posterior predictive distribution of a future observation y can be expressed as

p(y|D) = ∫ p(y|θ, D)p(θ|D)dθ = ∫ p(y|θ)p(θ|D)dθ,

where the second equality follows provided the new data point y and the observed data

D are independent given θ. The marginal density of the data, p(D) =

∫

p(D, θ)dθ,

also called evidence or model evidence, is useful for model selection. Both p(θ|D) and

p(D) play important roles in Bayesian inference. For some models the integrations

involved in these probabilities are computationally tractable; for instance, normal data with conjugate priors. However, for many models of interest, the integrations are intractable and approximations are required.

(3)

The approximate inference in Bayesian analysis can be divided into two cate-gories: deterministic approaches and nondeterministic approaches. Nondeterministic approaches refer to sampling-based approaches such as Markov Chain Monte Carlo (MCMC) methods. These methods are originated from chemistry and physics, and are well known in several different scientific disciplines. In the past decades MCMC methods have revolutionized statistical computing, and recent developments in MCMC algorithms and software have facilitated the use of Bayesian inference for complicated models. The advantages of MCMC methods include that there are theoretical guar-antees of convergence and that they often work well even for complex models. The disadvantages are that, for some applications, MCMC can be computational costly and convergence can be difficult to check. In particular, if data arrive sequentially and real-time estimation is required, MCMC methods can be computationally infeasible as it has to restart the whole sampling procedure when new data arrive. Some sequential Monte Carlo sampling methods, also referred to as the particle filters, have been proposed to obtain more effective algorithms for certain models (Handschin and Mayne [11], Liu and Chen [19], Doucet et al. [6], Liu and West [20], and references therein), but still MCMC is generally slow as it needs to draw tens of thousands of MCMC samples.

There are deterministic approximation inference methods as alternatives to Monte Carlo methods. Though they may be less accurate, they often are much more effective and have remarkably successful performance in some cases. The simplest and widely used deterministic approach is Laplace method. Let x be a p-dimensional vector, f be a twice differentiable real-valued function defined on Rp, and n a large number. The Laplace method gives the approximation for integrals:

∫ enf (x)dx≈ ( 2π n )p 2 | − ∇2_{f (x} 0)|− 1 2enf (x0)_,

where x0 is the unique mode of f and| · | is the determinant of a matrix. This method

has been applied in Bayesian statistics and machine learning; for example, see Tierney and Kadane [29], Kass and Raftery [16], Williams and Barber [32]. Recently, Rue et al. [26] proposed an integrated nested Laplace approximations as an alternative to MCMC for latent Gaussian models.

Other popular deterministic approaches include variational Bayes (Jordan et al. [15], Attias [2], Bishop [3]), expectation propagation (Minka [23]), among others. These deterministic methods have been widely used by computer scientists for machine

(4)

learn-ing, but have little presence in statistics. The present paper aims to provide an overview of these deterministic approaches with a special emphasize on a new deterministic ap-proximation proposed in Weng and Lin [31]. The organization of the paper is as follows. Section 2 provides an overview of variational Bayes approaches. Section 3 reviews ex-pectation propagation. Section 4 present a moment matching methods. Section 5 gives some discussions.

2. Variational Bayes

Variational Bayes methods probably are the most prominent and widely used de-terministic approximate inference in machine learning. Recently it has attracted some attention in statistical community; for example, Hall et al. [10] derived asymptotic distributional behavior for Gaussian variational approximation, Faes et al. [7] demon-strated that for regression with missing data, variational Bayes produces solutions of comparable accuracy to MCMC at much greater speed, Ormerod and Wand [25] de-vises an eﬀective variational approximation strategy for ﬁtting generalized linear mixed models, among others.

Variational Bayes methods are a family of Bayesian machine learning techniques that approximate intractable integrals using the notion of minimum Kullback-Leibler divergence and product assumptions on the posterior density. They can be used to lower bound the marginal likelihood and to provide an analytical approximation to the posterior distribution of the unobserved variables.

Let D denote the observed data and Z be a set of unobserved variables. Here Z may consist of both model parameters and latent variables. Suppose that we are interested in the posterior distribution p(Z|D) and the marginal likelihood of the data

p(D). Let p(D, Z) be the joint distribution of (D, Z) and q(Z) be any distribution of

Z. So,

p(D) =

∫

p(D, Z)dZ,

and by Jensen’s inequality, we have log p(D)≥

∫

q(Z) logp(D, Z)

(5)

It is desirable to maximize the right hand side of (1) with respect to q(Z) to get the tightest bound of p(D). The missing part on the right hand side of (1) can be computed by log p(D)−∫q(Z) logp(D,Z)_q(Z) dZ =∫ q(Z) log p(D)dZ−∫ q(Z) logp(D,Z)_q(Z) dZ =∫ q(Z) log_p(Zq(Z)_|D)dZ = KL(q(Z)∥p(Z|D)), (2)

which is the KL-divergence between q(Z) and p(Z|D). Therefore, log p(D) =

∫

q(Z) logp(D, Z)

q(Z) dZ + KL(q(Z)∥p(Z|D));

and the maximization problem maxq

∫

q(Z) logp(D, Z) q(Z) dZ

is equivalent to minimize the KL divergence KL(q(Z)∥p(Z|D)). Clearly, the minimum KL(q(Z)∥p(Z|D)) = 0

is achieved at q(Z) = p(Z|D). However, as p(Z|D) is often intractable, variational Bayes seeks to minimize KL(q∥p(Z|D)) under product assumptions:

q(Z) = I

∏

i=1 qi(Z),

where{Z1, ..., ZI} is a partition of Z. Under this restriction, the optimal solution can

be shown to satisfy the equations

log q_i∗(Zi) = E−Zi[log p(D, Z)] + const, i = 1, ..., I, (3)

where E_−Zi denotes expectation with respect to the density

∏

j:j̸=iqj(Zj); the readers

are referred to Bishop [3] for detailed derivations. These equations say that the log-arithm of the optimal solution for factor qi is the expectation of the logarithm of the

joint distribution (for both observed and unobserved variables) with respect to all of the other factors.

Since the optimal solution q_i∗ depends on the other factor qj, j ̸= i, the solution

is obtained recursively: ﬁrst initializing qi(Zi) for i = 1, ..., I, and then updating each qi(Zi) by (3) using the current estimates of all the other factors. Convergence of the

(6)

algorithm is guaranteed under mild conditions; see Boyd and Vandenberghe [4] and Luenberger and Ye [21]. Once it is converged, one has an analytical approximation

q∗ for the posterior distribution p(Z|D) and a lower bound for the evidence p(D). Overall, this method is fast to compute, but the disadvantage is that there can be some systematic error.

3. Expectation Propagation

The Expectation Propagation (EP) algorithm (Minka [23]) is an iterative approach to approximate posterior distributions. It is as an extension of assumed-density filtering (ADF), also called moment matching, proposed in different scientific areas; for example, Maybeck [22], Lauritzen [18]. Let D ={x1, ..., xn} and p(θ) be the prior distribution.

So, the joint distribution of (D, θ) and the posterior distribution of θ are

p(D, θ) = p(θ) n ∏ i=1 p(xi|θ) and p(θ|D) ∝ p(θ) n ∏ i=1 p(xi|θ). (4)

Assumed density ﬁltering starts by writing each term on the right hand side of (4) as

ti(θ); that is, t1(θ) = p(θ), ti(θ) = p(xi−1|θ) for i = 2, ..., n + 1. This gives

p(θ) n ∏ i=1 p(xi|θ) = n+1_∏ i ti(θ). (5)

Next, choose a suitable parametric distribution q(θ). Then, sequentially incorporate each ti and update from an old q(θ) to a new q(θ). The update step is achieved by

approximately minimizing the Kullback-Leibler divergence between the exact posterior distribution

p = ti(θ)q

old_(θ)

∫

θti(θ)qold(θ)dθ and its approximation q:

minqKL(p|q) = min ∫ p(θ) log [ p(θ) q(θ) ] dθ. (6)

(7)

Note that here p is the exact distribution; this is the reversed form of the Kullback-Leibler divergence (2) for variational inference. With the constraint that q is in the Gaussian family, the solution follows from moment matching, or expectation con-straints:

Ep(X) = Eq(X) and Ep(X2) = Eq(X2).

For other members of the exponential family, there will be diﬀerent expectation con-straints.

Clearly, the result of ADF depends on the ordering of ti. It is appropriate for online

or real-time learning where data points arrive sequentially and one needs to discard each data point after learning from it. However, in an oﬄine (or batch) setting one can re-use the data points to improve the ADF approximation, and this is the underlying thought of EP algorithm. The idea of EP is to approximate each ti in (5) by ˜ti, which

is assumed to come from a parametric distribution (e.g. from the exponential family), and approximate the posterior distribution p(θ|D) by q of the form

q(θ)∝∏ i

˜

ti(θ).

Speciﬁcally, it starts by initializing the term approximation ˜ti, and then cycles through

the terms to refine them one by one. To refine the term ˜ti, first remove this term from the current approximation to get ∏_j:j_̸=i˜tj, and then determine the revised form ˜tnewi

such that qnew(θ)∝ ˜tnew_i ∏ j:j̸=i ˜ tj is as close as possible to ti ∏ j:j̸=i ˜ tj,

where the latter step is achieved by minimizing the Kullback-Leibler divergence (6): ˜ tnew_i = argmin_tKL(ti ∏ j:j̸=i ˜ tj∥t ∏ j:j̸=i ˜ tj).

Once convergence, qnew is an approximate of the posterior distribution, and the nor-malized constant of qnew is used to approximate the model evidence p(D):

p(D)≈∫ ∏ i

˜

(8)

One problem of the EP algorithm is that it does not always converge. Minka [23] proposed some remedies for the convergence problem in certain applications. There are other studies on the convergence issue; for example, Heskes and Zoeter [13]. The EP algorithm has been successfully in many applications, including the TrueSkillTMsystem ([12]), the online ranking system for Xbox Live developed at Microsoft Research.

4. A moment matching method

Weng and Lin [31] proposed a moment matching method based on a version of Stein’s identity (Woodroofe, [34]) and exact calculation of certain integrals. To distin-guish this identity from Stein’s lemma, they refer to it as Woodroofe-Stein’s identity. The idea of it is similar to Stein’s lemma [27], but Stein developed the expectation with respect to a normal distribution, while Woodroofe concerned the integration with respect to a “nearly normal distribution” dΓ(z) = f (z)ϕp(z). Stein’s lemma is famous

and of interest because of its applications to James-Stein estimator [14] and empirical Bayes methods. On the contrary, Woodroofe-Stein’s identity is little known.

Below we present this approximation method. Let ϕp(z) and Φp(z) denote the

density and distribution function of a p-variate standard normal variate; let Φ and ϕ be the abbreviations of Φ1 and ϕ1, respectively. Let ϕ(x|µ, σ) be the density of normal

distribution with mean µ and standard deviation σ.

Let Z = [Z1, . . . , Zp]T be a p-dimensional random vector with the probability

den-sity

Cϕp(z)f (z), (7)

where f : Rp _{→ R is continuously diﬀerentiable and C is the normalizing constant.}

The following equations are simple consequences of Woodroofe-Stein’s identity:

E[Z] = E [ ∇f(Z) f (Z) ] , (8) E[ZiZq] = δiq+ E [ ∇2_{f (Z)} f (Z) ] iq , i, q = 1, . . . , p, (9) where δiq = 1 if i = q and 0 otherwise, and [·]iq indicates the (i, q) component of a

(9)

By regarding ϕp(z) in (7) as the prior density of z and Cf (z) as the likelihood

based on new observations, (8) and (9) show how posterior moments are related to the likelihood. In many situations the likelihood has a product form f =∏m_j=1fj. For such cases, (8) and (9) can be easily simpliﬁed:

E[Zi] = E [ ∂f1(Z)/∂Zi f1(Z) +· · · + ∂fm(Z)/∂Zi fm(Z) ] , (10) V ar[Zi] = 1 + E [ ∂ ∂Zi ( ∂f1(Z)/∂Zi f1(Z) +· · · + ∂fm(Z)/∂Zi fm(Z) )] . (11)

The rudiments of simple analytic approximations are obtained by setting Z in the right hand side of these equations as zero. Take (8) for illustration:

E[Z]≈ E ([ ∇f(Z) f (Z) ] Z=0 ) . (12)

The reasoning behind this is that the normalized quantity Z may approximately follow standard normal distribution in many applications.

Now consider the model of ranked data. Assume that the strength θiof player i

fol-lows N (µi, σi2) as in online rating systems such as Glicko [9] and TrueSkillTM[12]. Then,

upon observing the game outcomes D, the skill can be updated as N (µnew_i , (σ2_i)new), where µnew_i and (σ2_i)new are the posterior mean and variance of θi derived from the

posterior distribution of θ = (θ1,· · · , θp):

p(θ|D) = Cϕp(θ|µ, σ)P (D), (13)

where P (D) is the likelihood based on the game outcomes D. Some popular prob-ability models for paired comparison data assume that there are unobserved actual performance Xi and Xj and that

P (i beats j) = P (Xi− Xj > 0).

Two commonly used distributions for Xi−Xj are normal (corresponding to the Thurstone-Mosteller model [28]) and logistic (corresponding to the Bradley-Terry model). Let

Zi = (θi− µi)/σi. From (13), it is not diﬃcult to see that the posterior distribution of

Z = (Z1,· · · , Zp) is of the form (7).

In general there is no analytical forms for the posterior moments. Nevertheless, exact analytical forms exist in some cases except for the conjugate family. For instance,

(10)

when P (D) in (13) follows the Thurstone-Mosteller model for paired comparison, exact calculation shows E(θ1) = µ1+ σ₁2 √∑2 i=1(βi2+ σi2) ϕ ( µ1−µ2 √∑2 i=1(βi2+σ2i) ) Φ ( µ1−µ2 √∑2 i=1(βi2+σ 2 i) ). (14)

The rudiment of simple analytic approximation in (12) (equivalent to replace the un-known parameter θ on the right hand side of (8) with their current estimates µ) gives

E(θ1)≈ µ1+ σ₁2 √ β2₁+ β₂2 ϕ ( µ1−µ2 √ β2 1+β22 ) Φ ( µ1−µ2 √ β2 1+β22 ). (15)

From (14), Weng and Lin [31] propose to improve (15) by suitable scaling. Furthermore, if f in (8) is the logistic distribution function, reasonable analytic approximations can be obtained by approximating the logistic distribution by a normal one (Aitchison and Begg [1], Glickman [8]). Together with (10) and (11), Weng and Lin [31] obtained simple analytic rules to update players’ strengths from games that may involve multiple teams and multiple players. The experiments show that the prediction accuracy of the proposed online algorithm is comparable to TrueSkillTM_{, but the running time as well}

as the code are much shorter.

This method has been used in other applications of ranked data. For example, Wistuba et al. [33] modiﬁed it for move prediction of computer Go and obtained promising results; Chen et al. [5] employed it to give an eﬃcient online Bayesian ranking scheme in a crowdsourced setting. Recently Weng [30] applied it to obtain an online parameter estimation algorithm for an ordinal item response model, with an application for Internet ratings data.

5. Discussions

The methods of Laplace approximations, variational Bayes, and expectation prop-agation have been well developed. Comparisons of these methods together with their strengths and weaknesses are also available (Kuss and Rasmussen [17], Nickisch and Rasmussen [24], among others). In contrast, the moment matching method in Section 4 still new, and there may be much room for improvement and continued development.

(11)

For example, in addition to moment approximations, it would be interesting to ex-tend this method to approximate the posterior distribution p(θ|D) and the the model evidence p(D). Moreover, the current applications are for online learning, possibly incorporate equations (10) and (11) with few terms (i.e. small m). The error can be greater when using large m. It is worth further studying.

We summarize here by noting that no single method is going to be best on all situations. Each method has its advantages and disadvantages. So, one may select an appropriate technique for the problem at hand.

References

[1] J. Aitchison and C. B. Begg. Statistical diagnosis when basic cases are not classiﬂed with certainty. Biometrika, 63:1-12, 1976.

[2] H. Attias. A variational Bayesian framework for graphical models. In T. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing

Systems 12, pages 209-215. MIT Press, Cambridge, MA, 2000.

[3] C. M. Bishop. Pattern Recognition and Machine Learning. Springer, New York, 2006.

[4] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.

[5] X. Chen, P. N. Bennett, K. Collins-Thompson, and E. Horvitz. Pairwise ranking aggregation in a crowdsourced setting. In The Sixth ACM International

Confer-ence on Web Search and Data Mining, 2013, to appear.

[6] A. Doucet, S. Godsill, and C. Andrieu. On sequential monte carlo sampling meth-ods for bayesian ﬁltering. Statistics and Computing, 10:197-208, 2000.

[7] C. Faes, J. T. Ormerod, and M. P.Wand. Variational Bayesian inference for para-metric and nonparapara-metric regression with missing data. J. Amer. Statist. Assoc., 106:959-971, 2011.

[8] M. E. Glickman. Paired comparison models with time-varying parameters. PhD thesis, Department of Statistics, Harvard University, 1993.

(12)

[9] M. E. Glickman. Parameter estimation in large dynamic paired comparison exper-iments. Applied Statistic, 48(3):377-394, 1999.

[10] P. Hall, T. Pham, M. P. Wand, and S. S. J. Wang. Asymptotic normality and valid inference for Gaussian variational approximation. Ann. Statist., 39(5):2502-2532, 2011.

[11] J. E. Handschin and D. Q. Mayne. Monte Carlo techniques to estimate the conditional expectation in multi-stage non-linear ﬁltering. International Journal

of Control, 9:547-559, 1969.

[12] R. Herbrich, T. Minka, and T. Graepel. T rueSkillT M: A Bayesian skill rating system. In B. Scholkopf, J. Platt, and T. Hoﬀman, editors, Advances in Neural

Information Processing Systems 19, pages 569-576. MIT Press, Cambridge, MA,

2007.

[13] T. Heskes and O. Zoeter. Expectation propagation for approximate inference in dynamic Bayesian networks. In A. Darwiche and N. Friedman, editors,

Proceed-ings UAI-2002, pages 216-233, 2002.

[14] W. James and C. Stein. Estimation with quadratic loss. In Proc. Fourth Berkeley

Symp. Math. Statist. Prob., volume 1, pages 361-379, 1961.

[15] M. I. Jordan, Z. Ghahramani, and T. S. J. an d Lawrence K. Saul. An introduction to variational methods for graphical models. Machine Learning. 37:183-233, 1999. [16] R. E. Kass and A. E. Raftery. Bayes factors J. Amer. Statist. Assoc.,

90(430):773-795,1995.

[17] M. Kuss and C. E. Rasmussen. Assessing approximate inference for binary Gaus-sian process classiﬁcation. Journal of Machine Learning Research, 6:1679-1704, 2005.

[18] S. L. Lauritzen. Propagation of probabilities, means and variances in mixed graphical association models. J. Amer. Statist. Assoc., 87:1098-1108, 1992. [19] J. S. Liu and R. Chen. Sequential monte carlo methods for dynamical systems.

(13)

[20] J. S. Liu and R. Chen. Combined parameter and state estimation in simulation-based ﬁltering. In A. Doucet, J. F. G. de Freitas, and N. J. G. (Eds.), editors,

Sequential Monte Carlo Methods in Practice. New York, 2001. Springer-Verlag.

[21] D. G. Luenberger and Y. Ye. Linear and Nonlinear Programming, Springer, New York, third edition, 2008.

[22] P. S. Maybeck. Stochastic Models, Estimation, and Control. Academic Press, 1982.

[23] T. Minka. A family of algorithms for approximate Bayesian inference. PhD thesis, MIT, 2001.

[24] H. Nickisch and C. E. Rasmussen. Approximations for binary Gaussian process classi-ﬁcation. Journal of Machine Learning Research, 9:2035-2078, 2008. [25] J. T. Ormerod and M. P. Wand. Gaussian variational approximate inference

for generalized linear mixed models. Journal of Computational and Graphical

Statistics, 21(1):2-17, 2012.

[26] H. Rue, S. Martino, and N. Chopin. Approximate Bayesian inference for latent Gaussian models by using integrated nested laplace approximations. Journal of

the Royal Statistical Society, 71:319-392, 2009.

[27] C. Stein. Estimation of the mean of a multivariate normal distribution. JAnn.

Statist., 9:1135-1151, 1981.

[28] L. L. Thurstone. A law of comparative judgement. Psychological Reviews, 34:273-286, 1927.

[29] L. Tierney and J. B. Kadane. Accurate approximations for posterior moments and marginal densities. J. Amer. Statist. Assoc., 81:82-86, 1986.

[30] R. C. Weng. Real-time Bayesian parameter estimation for item response models.

Technical Report, National Chengchi University, 2013.

[31] R. C. Weng and C.-J. Lin. A Bayesian approximation method for online ranking.

(14)

[32] C. K. I. Williams and D. Barber. Bayesian classiﬁcation with Gaussian processes.

IEEE Transactions on Pattern Analysis and Machine Learning, 20(12):1342-1351,

1998.

[33] M. Wistuba, L. Schaefers, and M. Platzner. Comparison of bayesian move predic-tion systems for computer go. In IEEE Conference on Computapredic-tional Intelligence

and Games, 2012.

[34] M. Woodroofe. Very weak expansions for sequentially designed experiments: linear models. Ann. Statist., 17:1087-1102, 1989.

(15)

Journal of the Chinese Statistical Association Vol. 52, (2014) 44–58

機器學習中的近似貝氏推論

翁久幸國立政治大學統計學系摘要機器學習這門學科是關於設計演算法,讓計算機得以透過演算法從數據中自動分析學習資料的規律。很多的機器學習方法根基於機率模型以及貝氏統計的架構。近似的貝氏推論方法中, 像馬可夫鏈蒙特卡羅這類的隨機模擬向來廣為人知且深受歡迎。然而,除了隨機模擬之外,還有一些確定性的近似方法在許多應用中獲得相當成功。我們在這篇文章裡將介紹若干確定性近似方法的概念和發展。關鍵詞:近似貝氏推論;機器學習。 JEL classiﬁcation: C110.