Introduction to Bayesian Statistics
Lecture 13: Review
Rung-Ching Tsai
Department of Mathematics National Taiwan Normal University
June 3, 2015
Topics covered so far
• Frequentist vs. Bayesian statistics
• Single-parameter models
• Multiple-parameter models
• Monte-Carlo approximation
• Regression
• Model checking
• Model comparison
• Bayesian tools: WinBUGS & JAGS
• Assessing convergence
Frequentist vs. Bayesian statistics
• Frequentist (classical) statistics
◦ In Frequentist statistics,parameters are fixed, and we think of properties of estimation methods inrepeated sampling, that is, when we imagine taking many random samples from the population that generated our observed data.
◦ It is not meaningful to talk about the probability that the parameter falls within a range, such as Prob(θ > 0), or Prob(θ ∈ [a, b]) = 0.95.
• Bayesian statistics
◦ Probability measures degree of uncertainty. Inference is conditional on the observed data.
◦ There is not much distinction between parameters and random variables. Thus, it is perfectly legitimate to talk about theprobability that the parameter falls within a range, such as Prob(θ > 0), or Prob(θ ∈ [a, b]) = 0.95.
3 of 20
Basic Idea of Bayesian: From Prior to Posterior
• Goal: Estimate the parameter of interest θ or predict ˜y
• Three steps first
◦ specify the prior: p(θ)
◦ consider the likelihood of θ: l (θ|y ) = p(y |θ)
◦ find the posterior distribution of θ:
p(θ|y ) =p(θ, y )
p(y ) =p(θ)p(y |θ)
p(y ) ∝ p(θ)p(y |θ)
• Point estimation for θ
• Credible interval estimation for θ
• Predictive interval for ˜y
Point estimation
• Idea: A Bayes estimator ˆθ minimizes the expected loss generated by an estimate for a specific loss function.
◦ EAP (expected a posteriori) with the quadratic loss function l2(ˆθ, θ) = (ˆθ − θ)2:
θˆEAP = E(θ|y ) = Z
θp(θ|y )d θ.
◦ MAP (maximum a posteriori) with the zero one loss function l3(ˆθ, θ) =
(0, |ˆθ − θ| ≤ 1, |ˆθ − θ| > :
θˆMAP = Mode of p(θ|y ) = arg max
θ p(θ|y ).
◦ The median with the linear loss function l1(ˆθ, θ) = |ˆθ − θ|:
θ = Med(θ|y ) :ˆ Z θˆ
−∞
p(θ|y )d θ = 0.5.
5 of 20
Interval estimation
• A credible interval [a, b] to the level (1 − α) is defined as Z b
a
p(θ|y )d θ = 1 − α,
where a, b ∈ R and p(θ|y ) the posterior distribution of θ.
◦ Quantile-based intervals: Typically the respective quantiles of p(θ|y ) are used as endpoints. For instance, its 2.5% and 97.5% quantiles are used to construct a 95% credible interval for θ.
◦ Highest posterior density (HPD) interval: The minimum density of any point within that region is equal to or larger than the density of any point outside that region.
p(θ|y ) ≥ p(˜θ|y ), ∀θ ∈ I = [a, b] ⊂ Θ and ˜θ /∈ I .
Predictive interval for ˜ y
• After obtaining the posterior distribution of θ, p(θ|y ), we can compute the posterior predictive distribution of future observation
˜ y as p(˜y |y ) =
Z
Θ
p(˜y , θ|y )d θ = Z
Θ
p(˜y |θ, y )p(θ|y )d θ = Z
Θ
p(˜y |θ)p(θ|y )d θ.
• A 100(1 − α)% posterior predictive interval [c,d] for ˜y is similarly defined as
Z d
c
p(˜y |y )d θ = 1 − α, where c, d ∈ R.
• Note thatprior predictive distribution of ˜y : p(˜y ) =
Z
Θ
p(˜y , θ)d θ = Z
Θ
p(˜y |θ)p(θ)d θ.
7 of 20
Choices of Prior
• Conjugate prior: If F is a class of sampling distributions p(y |θ), and the class P of prior is conjugate for F if
p(θ|y ) ∈ P for all p(·|θ) ∈ F and p(θ) ∈ P.
• Non-informative prior
◦ uniform prior distributions ondifferent scales
• p(θ) = 1 or p(logitθ) ∝ 1 corresponds to p(θ) ∝ θ−1(1 − θ)−1
• p(σ) ∝ 1 or p(logσ2) ∝ 1 corresponds to p(σ2) ∝ σ−2
◦ invariant to parameterization-Jeffrey’s principle p(θ) ∝ [J(θ)]1/2, where J(θ) is the Fisher information for θ:
J(θ) = E
"
dlogp(y |θ) d θ
2
|θ
#
= −E d2logp(y |θ) d θ2 |θ
Models
• Single-parameter models
◦ discrete: X ∼ Poisson(λ), X ∼ Binomial(n, p) etc.
◦ continuous: X ∼ Normal(µ, σ2) with σ2known, X ∼ exponential(λ) etc.
• Multiparameter models
◦ discrete: X1, · · · , Xq∼ Multinomial(n; p1, · · · , pq)
◦ continuous: X ∼ N(µ, σ2)
• Regression models
◦ Y discrete: logistic regression
◦ Y continuous: multiple regression, ANOVA, ANCOVA
9 of 20
Monte Carlo Approximation
• We are often interested in summarizing not only the posterior distribution of the parameter(s), but also other aspects of a posterior distribution. For examples:
1. Pr(θ ∈ A|y1, · · · , yn) for arbitrary sets A.
2. posterior distribution of |θ1− θ2|
3. posterior distribution of max{θ1, · · · , θm}
• Obtaining exact values for these posterior quantities can be difficult or impossible.
• Instead, we can generate random sample values of the parameters from their posterior distributions, then all of these posterior
quantities of interest can be approximated to an arbitrary degree of precision using the Monte Carlo method.
Monte Carlo approximation
Once we have θ(1), · · · , θ(S) iid samples from p(θ|y1, · · · , yn)
• The empirical distribution of {θ(1), · · · , θ(S)} is known as a Monte Carlo approximation to p(θ|y1, · · · , yn).
• θ =¯ PS
s=1θ(s)/S → E[θ|y1, · · · , yn];
• PS
s=1(θ(s)− ¯θ)2/(S − 1) → Var[θ|y1, · · · , yn];
• #(θ(s)≤ c)/S → Pr(θ ≤ c|y1, · · · , yn);
• the median of {θ(1), · · · , θ(S)} → θ1/2;
• the α-percentile of {θ(1), · · · , θ(S)} → θα.
• For any function of θ, g (θ),
1 S
PS
s=1g (θ(s)) → E(g (θ)|y1, · · · , yn) =R g (θ)p(θ|y1, · · · , yn) as S → ∞.
11 of 20
Simulations instead of Integration
• Successive conditional simulations
◦ Draw θ2from its marginal posterior distribution, p(θ2|y ).
◦ Draw θ1from conditional posterior distribution given the drawn value of θ2, p(θ1|θ2, y ).
• All-Others conditional simulations (Gibbs sampler)
◦ Draw θ(t+1)1 from conditional posterior distribution given the previous drawn value of θ2(t), p(θ1|θ(t)2 , y ).
◦ Draw θ(t+1)2 from conditional posterior distribution given the drawn value of θ(t)1 , p(θ2|θ1(t), y ).
◦ Iterating the procedure will ultimately generate samples from the marginal posterior distribution of p(θ1, θ2|y ).
Model checking
• Idea: check “ALL” aspects of the “model”
• Why? Prior-to-Posterior inferences involve the whole structure (with hierarchies) of the Bayesian model can produce spurious inference if the model is poor.
◦ sensitivity analysis to thepriordistribution
◦ the appropriateness of thelikelihoodmodel (i.e., the sampling distribution)
◦ anyhierarchical structure
◦ other issues such as whichexplanatory variables or covariatesshould have been included in a model
◦ Any particularfeature of the dataone wishes to capture
13 of 20
Posterior predictive p-value
• Let y = (y1, y2, · · · , yn)0 be the observed data and θ be the set of all parameters (including all hyperparameters) for a model
p(θ|y) ∝ p(θ) × p(y|θ).
• Let yrep= (yrep,1, yrep,2, · · · , yrep,n)0 be the replicated data that we would see if the experiment was replicated with the same model and the same value of θ that produced the observed data y.
• Replicated data yrep, like predictions ˜y, has two components of uncertainty:
p(yrep|y) = Z
p(yrep|θ)p(θ|y)d θ
◦ the fundamental variability of the model, represented by the posited variability in the data
◦ the posterior uncertainty in the estimation of θ
Posterior predictive p-value
• Test quantities T (y, θ)
◦ measure the discrepancy between model and data in the aspects of the data one wishes to check
◦ Test quantities play a role in Bayesian model checking that test statistics, T (y), play in classical testing. In classical statistics, the test statistic T (y) does not depend upon model parameters.
• Tail-area probability
◦ Lack of fit of the data regarding the posterior predictive distribution can be measured by the tail-area probability, or p-value of the test quantity.
◦ It is commonly computed using posterior simulations of (θ, yrep).
15 of 20
Posterior predictive checks p-value
• Classical p-values (for test statistic T (y)) pC = Pr(T (yrep) ≥ T (y)|θ),
◦ the probability is taken over the distribution of yrep with θ fixed.
◦ the test statistic T (y) does not depend upon model parameters.
• Posterior predictive p-values (for test quantity T (y, θ)) pB = Pr(T (yrep, θ) ≥ T (y, θ)|y)
= Z Z
IT (yrep,θ)≥T (y,θ)p(yrep|θ)p(θ|y)d yrepd θ
◦ test quantities can be a function of the parameters and the data because the test quantity is evaluated over draws from the posterior distribution of the unknown θ and yrep.
Conditional Predictive Ordinate (CPO)
• Problem: Although the full predictive distribution p(yrep|y) is useful for prediction, its use for model-checking is questionable because of the double-use of the data, and causes predictive performance to be overestimated.
• Idea: The leave-one-out cross-validation predictive density has been proposed (Geisser and Eddy, 1979). This is also known as the Conditional Predictive Ordinate or CPO (Gelfand, 1996).
CPOi = p(yi|y[i ]) = Z
p(yi|θ)p(θ|y[i ])d θ
where yi is each instance of an observed y, and y[i ] is y without the current observation i .
• The CPO is a handy posterior predictive check because it may be used to identify outliers, influential observations, and for
hypothesis testing across different non-nested models.
17 of 20
Model comparison
• Model comparison based on predictive performance
◦ Deviance information criterion (DIC)
◦ Watanabe-Akaike information criterion (WAIC)
• Model comparison based on Bayes factor B[H2 : H1] = p(y |Hp(y |H2)
1)
where
p(y |H) = Z
p(y |θ, H)p(θ|H)d θ
and p(H2|y )
p(H1|y ) = p(H2) p(H1)
p(y |H2) p(y |H1).
Bayesian tools: WinBUGS & JAGS
• WinBUGS
• JAGS
http://www.johnmyleswhite.com/notebook/2010/08/20/using- jags-in-r-with-the-rjags-package/
19 of 20
Assessing convergence
• Run multiple chains with starting points dispersed throughout the sample space.
• Monitor the convergence of all quantities of interest by inspecting their traces and comparing variation between and within simulated sequences until these are almost equal. For example, use the potential scale reduction factor ˆR. At convergence, ˆR = 1 for all quantities of interest; stop simulations when they are all near 1 (say below 1.1).
• Discard a burn-in (and/or thin) the simulation sequences prior to inference.
◦ “burn-in”: making sure of convergence
◦ “thinning”: avoiding dependence among simulated draws of the posterior