Introduction to Bayesian Statistics Lecture 13: Review

(1)

Introduction to Bayesian Statistics

Lecture 13: Review

Rung-Ching Tsai

Department of Mathematics National Taiwan Normal University

June 3, 2015

(2)

Topics covered so far

• Frequentist vs. Bayesian statistics

• Single-parameter models

• Multiple-parameter models

• Monte-Carlo approximation

• Regression

• Model checking

• Model comparison

• Bayesian tools: WinBUGS & JAGS

• Assessing convergence

(3)

Frequentist vs. Bayesian statistics

• Frequentist (classical) statistics

◦ In Frequentist statistics,parameters are fixed, and we think of properties of estimation methods inrepeated sampling, that is, when we imagine taking many random samples from the population that generated our observed data.

◦ It is not meaningful to talk about the probability that the parameter falls within a range, such as Prob(θ > 0), or Prob(θ ∈ [a, b]) = 0.95.

• Bayesian statistics

◦ Probability measures degree of uncertainty. Inference is conditional on the observed data.

◦ There is not much distinction between parameters and random variables. Thus, it is perfectly legitimate to talk about theprobability that the parameter falls within a range, such as Prob(θ > 0), or Prob(θ ∈ [a, b]) = 0.95.

3 of 20

(4)

Basic Idea of Bayesian: From Prior to Posterior

• Goal: Estimate the parameter of interest θ or predict ˜y

• Three steps first

◦ specify the prior: p(θ)

◦ consider the likelihood of θ: l (θ|y ) = p(y |θ)

◦ find the posterior distribution of θ:

p(θ|y ) =p(θ, y )

p(y ) =p(θ)p(y |θ)

p(y ) ∝ p(θ)p(y |θ)

• Point estimation for θ

• Credible interval estimation for θ

• Predictive interval for ˜y

(5)

Point estimation

• Idea: A Bayes estimator ˆθ minimizes the expected loss generated by an estimate for a specific loss function.

◦ EAP (expected a posteriori) with the quadratic loss function l2(ˆθ, θ) = (ˆθ − θ)²:

θˆ_EAP = E(θ|y ) = Z

θp(θ|y )d θ.

◦ MAP (maximum a posteriori) with the zero one loss function l3(ˆθ, θ) =

(0, |ˆθ − θ| ≤ 1, |ˆθ − θ| > :

θˆ_MAP = Mode of p(θ|y ) = arg max

θ p(θ|y ).

◦ The median with the linear loss function l1(ˆθ, θ) = |ˆθ − θ|:

θ = Med(θ|y ) :ˆ Z θ^ˆ

−∞

p(θ|y )d θ = 0.5.

5 of 20

(6)

Interval estimation

• A credible interval [a, b] to the level (1 − α) is defined as Z b

a

p(θ|y )d θ = 1 − α,

where a, b ∈ R and p(θ|y ) the posterior distribution of θ.

◦ Quantile-based intervals: Typically the respective quantiles of p(θ|y ) are used as endpoints. For instance, its 2.5% and 97.5% quantiles are used to construct a 95% credible interval for θ.

◦ Highest posterior density (HPD) interval: The minimum density of any point within that region is equal to or larger than the density of any point outside that region.

p(θ|y ) ≥ p(˜θ|y ), ∀θ ∈ I = [a, b] ⊂ Θ and ˜θ /∈ I .

(7)

Predictive interval for ˜ y

• After obtaining the posterior distribution of θ, p(θ|y ), we can compute the posterior predictive distribution of future observation

˜ y as p(˜y |y ) =

Z

Θ

p(˜y , θ|y )d θ = Z

Θ

p(˜y |θ, y )p(θ|y )d θ = Z

Θ

p(˜y |θ)p(θ|y )d θ.

• A 100(1 − α)% posterior predictive interval [c,d] for ˜y is similarly defined as

Z _d

c

p(˜y |y )d θ = 1 − α, where c, d ∈ R.

• Note thatprior predictive distribution of ˜y : p(˜y ) =

Z

Θ

p(˜y , θ)d θ = Z

Θ

p(˜y |θ)p(θ)d θ.

7 of 20

(8)

Choices of Prior

• Conjugate prior: If F is a class of sampling distributions p(y |θ), and the class P of prior is conjugate for F if

p(θ|y ) ∈ P for all p(·|θ) ∈ F and p(θ) ∈ P.

• Non-informative prior

◦ uniform prior distributions ondifferent scales

• p(θ) = 1 or p(logitθ) ∝ 1 corresponds to p(θ) ∝ θ⁻¹(1 − θ)⁻¹

• p(σ) ∝ 1 or p(logσ²) ∝ 1 corresponds to p(σ²) ∝ σ⁻²

◦ invariant to parameterization-Jeffrey’s principle p(θ) ∝ [J(θ)]^1/2, where J(θ) is the Fisher information for θ:

J(θ) = E

"

dlogp(y |θ) d θ

²

|θ

#

= −E d²logp(y |θ) d θ² |θ

(9)

Models

• Single-parameter models

◦ discrete: X ∼ Poisson(λ), X ∼ Binomial(n, p) etc.

◦ continuous: X ∼ Normal(µ, σ²) with σ²known, X ∼ exponential(λ) etc.

• Multiparameter models

◦ discrete: X₁, · · · , X_q∼ Multinomial(n; p1, · · · , p_q)

◦ continuous: X ∼ N(µ, σ²)

• Regression models

◦ Y discrete: logistic regression

◦ Y continuous: multiple regression, ANOVA, ANCOVA

9 of 20

(10)

Monte Carlo Approximation

• We are often interested in summarizing not only the posterior distribution of the parameter(s), but also other aspects of a posterior distribution. For examples:

1. Pr(θ ∈ A|y1, · · · , yn) for arbitrary sets A.

2. posterior distribution of |θ1− θ2|

3. posterior distribution of max{θ1, · · · , θm}

• Obtaining exact values for these posterior quantities can be difficult or impossible.

• Instead, we can generate random sample values of the parameters from their posterior distributions, then all of these posterior

quantities of interest can be approximated to an arbitrary degree of precision using the Monte Carlo method.

(11)

Monte Carlo approximation

Once we have θ⁽¹⁾, · · · , θ^(S) iid samples from p(θ|y1, · · · , yn)

• The empirical distribution of {θ⁽¹⁾, · · · , θ^(S)} is known as a Monte Carlo approximation to p(θ|y1, · · · , yn).

• θ =¯ PS

s=1θ^(s)/S → E[θ|y1, · · · , yn];

• PS

s=1(θ^(s)− ¯θ)²/(S − 1) → Var[θ|y₁, · · · , y_n];

• #(θ^(s)≤ c)/S → Pr(θ ≤ c|y₁, · · · , y_n);

• the median of {θ⁽¹⁾, · · · , θ^(S)} → θ_1/2;

• the α-percentile of {θ⁽¹⁾, · · · , θ^(S)} → θ_α.

• For any function of θ, g (θ),

1 S

PS

s=1g (θ^(s)) → E(g (θ)|y₁, · · · , y_n) =R g (θ)p(θ|y₁, · · · , y_n) as S → ∞.

11 of 20

(12)

Simulations instead of Integration

• Successive conditional simulations

◦ Draw θ₂from its marginal posterior distribution, p(θ₂|y ).

◦ Draw θ₁from conditional posterior distribution given the drawn value of θ2, p(θ1|θ2, y ).

• All-Others conditional simulations (Gibbs sampler)

◦ Draw θ^(t+1)₁ from conditional posterior distribution given the previous drawn value of θ₂^(t), p(θ1|θ^(t)₂ , y ).

◦ Draw θ^(t+1)₂ from conditional posterior distribution given the drawn value of θ^(t)₁ , p(θ₂|θ₁^(t), y ).

◦ Iterating the procedure will ultimately generate samples from the marginal posterior distribution of p(θ₁, θ₂|y ).

(13)

Model checking

• Idea: check “ALL” aspects of the “model”

• Why? Prior-to-Posterior inferences involve the whole structure (with hierarchies) of the Bayesian model can produce spurious inference if the model is poor.

◦ sensitivity analysis to thepriordistribution

◦ the appropriateness of thelikelihoodmodel (i.e., the sampling distribution)

◦ anyhierarchical structure

◦ other issues such as whichexplanatory variables or covariatesshould have been included in a model

◦ Any particularfeature of the dataone wishes to capture

13 of 20

(14)

Posterior predictive p-value

• Let y = (y1, y2, · · · , yn)⁰ be the observed data and θ be the set of all parameters (including all hyperparameters) for a model

p(θ|y) ∝ p(θ) × p(y|θ).

• Let y^rep= (y^rep,1, y^rep,2, · · · , y^rep,n)⁰ be the replicated data that we would see if the experiment was replicated with the same model and the same value of θ that produced the observed data y.

• Replicated data y^rep, like predictions ˜y, has two components of uncertainty:

p(y^rep|y) = Z

p(y^rep|θ)p(θ|y)d θ

◦ the fundamental variability of the model, represented by the posited variability in the data

◦ the posterior uncertainty in the estimation of θ

(15)

Posterior predictive p-value

• Test quantities T (y, θ)

◦ measure the discrepancy between model and data in the aspects of the data one wishes to check

◦ Test quantities play a role in Bayesian model checking that test statistics, T (y), play in classical testing. In classical statistics, the test statistic T (y) does not depend upon model parameters.

• Tail-area probability

◦ Lack of fit of the data regarding the posterior predictive distribution can be measured by the tail-area probability, or p-value of the test quantity.

◦ It is commonly computed using posterior simulations of (θ, y^rep).

15 of 20

(16)

Posterior predictive checks p-value

• Classical p-values (for test statistic T (y)) p_C = Pr(T (y^rep) ≥ T (y)|θ),

◦ the probability is taken over the distribution of y^rep with θ fixed.

◦ the test statistic T (y) does not depend upon model parameters.

• Posterior predictive p-values (for test quantity T (y, θ)) pB = Pr(T (y^rep, θ) ≥ T (y, θ)|y)

= Z Z

I_{T (y}^rep,θ)≥T (y,θ)p(y_rep|θ)p(θ|y)d y^repd θ

◦ test quantities can be a function of the parameters and the data because the test quantity is evaluated over draws from the posterior distribution of the unknown θ and y^rep.

(17)

Conditional Predictive Ordinate (CPO)

• Problem: Although the full predictive distribution p(y^rep|y) is useful for prediction, its use for model-checking is questionable because of the double-use of the data, and causes predictive performance to be overestimated.

• Idea: The leave-one-out cross-validation predictive density has been proposed (Geisser and Eddy, 1979). This is also known as the Conditional Predictive Ordinate or CPO (Gelfand, 1996).

CPO_i = p(y_i|y_{[i ]}) = Z

p(y_i|θ)p(θ|y_{[i ]})d θ

where y_i is each instance of an observed y, and y_{[i ]} is y without the current observation i .

• The CPO is a handy posterior predictive check because it may be used to identify outliers, influential observations, and for

hypothesis testing across different non-nested models.

17 of 20

(18)

Model comparison

• Model comparison based on predictive performance

◦ Deviance information criterion (DIC)

◦ Watanabe-Akaike information criterion (WAIC)

• Model comparison based on Bayes factor B[H2 : H1] = ^{p(y |H}_{p(y |H}²⁾

1)

where

p(y |H) = Z

p(y |θ, H)p(θ|H)d θ

and p(H₂|y )

p(H1|y ) = p(H₂) p(H1)

p(y |H₂) p(y |H1).

(19)

Bayesian tools: WinBUGS & JAGS

• WinBUGS

• JAGS

http://www.johnmyleswhite.com/notebook/2010/08/20/using- jags-in-r-with-the-rjags-package/

19 of 20

(20)

Assessing convergence

• Run multiple chains with starting points dispersed throughout the sample space.

• Monitor the convergence of all quantities of interest by inspecting their traces and comparing variation between and within simulated sequences until these are almost equal. For example, use the potential scale reduction factor ˆR. At convergence, ˆR = 1 for all quantities of interest; stop simulations when they are all near 1 (say below 1.1).

• Discard a burn-in (and/or thin) the simulation sequences prior to inference.

◦ “burn-in”: making sure of convergence

◦ “thinning”: avoiding dependence among simulated draws of the posterior