• 沒有找到結果。

Introduction to Bayesian Statistics Lecture 13: Review

N/A
N/A
Protected

Academic year: 2022

Share "Introduction to Bayesian Statistics Lecture 13: Review"

Copied!
20
0
0

加載中.... (立即查看全文)

全文

(1)

Introduction to Bayesian Statistics

Lecture 13: Review

Rung-Ching Tsai

Department of Mathematics National Taiwan Normal University

June 3, 2015

(2)

Topics covered so far

Frequentist vs. Bayesian statistics

Single-parameter models

Multiple-parameter models

Monte-Carlo approximation

Regression

Model checking

Model comparison

Bayesian tools: WinBUGS & JAGS

Assessing convergence

(3)

Frequentist vs. Bayesian statistics

Frequentist (classical) statistics

In Frequentist statistics,parameters are fixed, and we think of properties of estimation methods inrepeated sampling, that is, when we imagine taking many random samples from the population that generated our observed data.

It is not meaningful to talk about the probability that the parameter falls within a range, such as Prob(θ > 0), or Prob(θ ∈ [a, b]) = 0.95.

Bayesian statistics

Probability measures degree of uncertainty. Inference is conditional on the observed data.

There is not much distinction between parameters and random variables. Thus, it is perfectly legitimate to talk about theprobability that the parameter falls within a range, such as Prob(θ > 0), or Prob(θ ∈ [a, b]) = 0.95.

3 of 20

(4)

Basic Idea of Bayesian: From Prior to Posterior

Goal: Estimate the parameter of interest θ or predict ˜y

Three steps first

specify the prior: p(θ)

consider the likelihood of θ: l (θ|y ) = p(y |θ)

find the posterior distribution of θ:

p(θ|y ) =p(θ, y )

p(y ) =p(θ)p(y |θ)

p(y ) ∝ p(θ)p(y |θ)

Point estimation for θ

Credible interval estimation for θ

Predictive interval for ˜y

(5)

Point estimation

Idea: A Bayes estimator ˆθ minimizes the expected loss generated by an estimate for a specific loss function.

EAP (expected a posteriori) with the quadratic loss function l2θ, θ) = (ˆθ − θ)2:

θˆEAP = E(θ|y ) = Z

θp(θ|y )d θ.

MAP (maximum a posteriori) with the zero one loss function l3θ, θ) =

(0, θ − θ| ≤  1, θ − θ| > :

θˆMAP = Mode of p(θ|y ) = arg max

θ p(θ|y ).

The median with the linear loss function l1θ, θ) = |ˆθ − θ|:

θ = Med(θ|y ) :ˆ Z θˆ

−∞

p(θ|y )d θ = 0.5.

5 of 20

(6)

Interval estimation

A credible interval [a, b] to the level (1 − α) is defined as Z b

a

p(θ|y )d θ = 1 − α,

where a, b ∈ R and p(θ|y ) the posterior distribution of θ.

Quantile-based intervals: Typically the respective quantiles of p(θ|y ) are used as endpoints. For instance, its 2.5% and 97.5% quantiles are used to construct a 95% credible interval for θ.

Highest posterior density (HPD) interval: The minimum density of any point within that region is equal to or larger than the density of any point outside that region.

p(θ|y ) ≥ p(˜θ|y ), ∀θ ∈ I = [a, b] ⊂ Θ and ˜θ /∈ I .

(7)

Predictive interval for ˜ y

After obtaining the posterior distribution of θ, p(θ|y ), we can compute the posterior predictive distribution of future observation

˜ y as p(˜y |y ) =

Z

Θ

p(˜y , θ|y )d θ = Z

Θ

p(˜y |θ, y )p(θ|y )d θ = Z

Θ

p(˜y |θ)p(θ|y )d θ.

A 100(1 − α)% posterior predictive interval [c,d] for ˜y is similarly defined as

Z d

c

p(˜y |y )d θ = 1 − α, where c, d ∈ R.

Note thatprior predictive distribution of ˜y : p(˜y ) =

Z

Θ

p(˜y , θ)d θ = Z

Θ

p(˜y |θ)p(θ)d θ.

7 of 20

(8)

Choices of Prior

Conjugate prior: If F is a class of sampling distributions p(y |θ), and the class P of prior is conjugate for F if

p(θ|y ) ∈ P for all p(·|θ) ∈ F and p(θ) ∈ P.

Non-informative prior

uniform prior distributions ondifferent scales

p(θ) = 1 or p(logitθ) ∝ 1 corresponds to p(θ) ∝ θ−1(1 − θ)−1

p(σ) ∝ 1 or p(logσ2) ∝ 1 corresponds to p(σ2) ∝ σ−2

invariant to parameterization-Jeffrey’s principle p(θ) ∝ [J(θ)]1/2, where J(θ) is the Fisher information for θ:

J(θ) = E

"

 dlogp(y |θ) d θ

2

#

= −E d2logp(y |θ) d θ2



(9)

Models

Single-parameter models

discrete: X ∼ Poisson(λ), X ∼ Binomial(n, p) etc.

continuous: X ∼ Normal(µ, σ2) with σ2known, X ∼ exponential(λ) etc.

Multiparameter models

discrete: X1, · · · , Xq∼ Multinomial(n; p1, · · · , pq)

continuous: X ∼ N(µ, σ2)

Regression models

Y discrete: logistic regression

Y continuous: multiple regression, ANOVA, ANCOVA

9 of 20

(10)

Monte Carlo Approximation

We are often interested in summarizing not only the posterior distribution of the parameter(s), but also other aspects of a posterior distribution. For examples:

1. Pr(θ ∈ A|y1, · · · , yn) for arbitrary sets A.

2. posterior distribution of |θ1− θ2|

3. posterior distribution of max{θ1, · · · , θm}

Obtaining exact values for these posterior quantities can be difficult or impossible.

Instead, we can generate random sample values of the parameters from their posterior distributions, then all of these posterior

quantities of interest can be approximated to an arbitrary degree of precision using the Monte Carlo method.

(11)

Monte Carlo approximation

Once we have θ(1), · · · , θ(S) iid samples from p(θ|y1, · · · , yn)

The empirical distribution of {θ(1), · · · , θ(S)} is known as a Monte Carlo approximation to p(θ|y1, · · · , yn).

θ =¯ PS

s=1θ(s)/S → E[θ|y1, · · · , yn];

PS

s=1(s)− ¯θ)2/(S − 1) → Var[θ|y1, · · · , yn];

#(θ(s)≤ c)/S → Pr(θ ≤ c|y1, · · · , yn);

the median of {θ(1), · · · , θ(S)} → θ1/2;

the α-percentile of {θ(1), · · · , θ(S)} → θα.

For any function of θ, g (θ),

1 S

PS

s=1g (θ(s)) → E(g (θ)|y1, · · · , yn) =R g (θ)p(θ|y1, · · · , yn) as S → ∞.

11 of 20

(12)

Simulations instead of Integration

Successive conditional simulations

Draw θ2from its marginal posterior distribution, p(θ2|y ).

Draw θ1from conditional posterior distribution given the drawn value of θ2, p(θ12, y ).

All-Others conditional simulations (Gibbs sampler)

Draw θ(t+1)1 from conditional posterior distribution given the previous drawn value of θ2(t), p(θ1(t)2 , y ).

Draw θ(t+1)2 from conditional posterior distribution given the drawn value of θ(t)1 , p(θ21(t), y ).

Iterating the procedure will ultimately generate samples from the marginal posterior distribution of p(θ1, θ2|y ).

(13)

Model checking

Idea: check “ALL” aspects of the “model”

Why? Prior-to-Posterior inferences involve the whole structure (with hierarchies) of the Bayesian model can produce spurious inference if the model is poor.

sensitivity analysis to thepriordistribution

the appropriateness of thelikelihoodmodel (i.e., the sampling distribution)

anyhierarchical structure

other issues such as whichexplanatory variables or covariatesshould have been included in a model

Any particularfeature of the dataone wishes to capture

13 of 20

(14)

Posterior predictive p-value

Let y = (y1, y2, · · · , yn)0 be the observed data and θ be the set of all parameters (including all hyperparameters) for a model

p(θ|y) ∝ p(θ) × p(y|θ).

Let yrep= (yrep,1, yrep,2, · · · , yrep,n)0 be the replicated data that we would see if the experiment was replicated with the same model and the same value of θ that produced the observed data y.

Replicated data yrep, like predictions ˜y, has two components of uncertainty:

p(yrep|y) = Z

p(yrep|θ)p(θ|y)d θ

the fundamental variability of the model, represented by the posited variability in the data

the posterior uncertainty in the estimation of θ

(15)

Posterior predictive p-value

Test quantities T (y, θ)

measure the discrepancy between model and data in the aspects of the data one wishes to check

Test quantities play a role in Bayesian model checking that test statistics, T (y), play in classical testing. In classical statistics, the test statistic T (y) does not depend upon model parameters.

Tail-area probability

Lack of fit of the data regarding the posterior predictive distribution can be measured by the tail-area probability, or p-value of the test quantity.

It is commonly computed using posterior simulations of (θ, yrep).

15 of 20

(16)

Posterior predictive checks p-value

Classical p-values (for test statistic T (y)) pC = Pr(T (yrep) ≥ T (y)|θ),

the probability is taken over the distribution of yrep with θ fixed.

the test statistic T (y) does not depend upon model parameters.

Posterior predictive p-values (for test quantity T (y, θ)) pB = Pr(T (yrep, θ) ≥ T (y, θ)|y)

= Z Z

IT (yrep,θ)≥T (y,θ)p(yrep|θ)p(θ|y)d yrepd θ

test quantities can be a function of the parameters and the data because the test quantity is evaluated over draws from the posterior distribution of the unknown θ and yrep.

(17)

Conditional Predictive Ordinate (CPO)

Problem: Although the full predictive distribution p(yrep|y) is useful for prediction, its use for model-checking is questionable because of the double-use of the data, and causes predictive performance to be overestimated.

Idea: The leave-one-out cross-validation predictive density has been proposed (Geisser and Eddy, 1979). This is also known as the Conditional Predictive Ordinate or CPO (Gelfand, 1996).

CPOi = p(yi|y[i ]) = Z

p(yi|θ)p(θ|y[i ])d θ

where yi is each instance of an observed y, and y[i ] is y without the current observation i .

The CPO is a handy posterior predictive check because it may be used to identify outliers, influential observations, and for

hypothesis testing across different non-nested models.

17 of 20

(18)

Model comparison

Model comparison based on predictive performance

Deviance information criterion (DIC)

Watanabe-Akaike information criterion (WAIC)

Model comparison based on Bayes factor B[H2 : H1] = p(y |Hp(y |H2)

1)

where

p(y |H) = Z

p(y |θ, H)p(θ|H)d θ

and p(H2|y )

p(H1|y ) = p(H2) p(H1)

p(y |H2) p(y |H1).

(19)

Bayesian tools: WinBUGS & JAGS

WinBUGS

JAGS

http://www.johnmyleswhite.com/notebook/2010/08/20/using- jags-in-r-with-the-rjags-package/

19 of 20

(20)

Assessing convergence

Run multiple chains with starting points dispersed throughout the sample space.

Monitor the convergence of all quantities of interest by inspecting their traces and comparing variation between and within simulated sequences until these are almost equal. For example, use the potential scale reduction factor ˆR. At convergence, ˆR = 1 for all quantities of interest; stop simulations when they are all near 1 (say below 1.1).

Discard a burn-in (and/or thin) the simulation sequences prior to inference.

“burn-in”: making sure of convergence

“thinning”: avoiding dependence among simulated draws of the posterior

參考文獻

相關文件

For pedagogical purposes, let us start consideration from a simple one-dimensional (1D) system, where electrons are confined to a chain parallel to the x axis. As it is well known

The observed small neutrino masses strongly suggest the presence of super heavy Majorana neutrinos N. Out-of-thermal equilibrium processes may be easily realized around the

incapable to extract any quantities from QCD, nor to tackle the most interesting physics, namely, the spontaneously chiral symmetry breaking and the color confinement.. 

(1) Determine a hypersurface on which matching condition is given.. (2) Determine a

• Formation of massive primordial stars as origin of objects in the early universe. • Supernova explosions might be visible to the most

正向成就 (positive accomplishment) 正向目標 (意義) (positive purpose) 正向健康 (positive health).. Flourish: A visionary new understanding of happiness

The difference resulted from the co- existence of two kinds of words in Buddhist scriptures a foreign words in which di- syllabic words are dominant, and most of them are the

(Another example of close harmony is the four-bar unaccompanied vocal introduction to “Paperback Writer”, a somewhat later Beatles song.) Overall, Lennon’s and McCartney’s