• 沒有找到結果。

Introduction to Bayesian Statistics Lecture 4: Multiparameter models (I)

N/A
N/A
Protected

Academic year: 2022

Share "Introduction to Bayesian Statistics Lecture 4: Multiparameter models (I)"

Copied!
17
0
0

加載中.... (立即查看全文)

全文

(1)

Introduction to Bayesian Statistics

Lecture 4: Multiparameter models (I)

Rung-Ching Tsai

Department of Mathematics National Taiwan Normal University

March 18, 2015

(2)

Noninformative prior distributions

Proper and improper prior distributions

Unnormalized densities

Uniform prior distributions on different scales

Some examples

Probability parameter θ ∈ (0, 1)

One possibility: p(θ) = 1 [proper]

Another possibility: p(logitθ) ∝ 1 corresponds to p(θ) ∝ θ−1(1 − θ)−1 [improper]

Location parameter θ unconstrained

One possibility: p(θ) ∝ 1 [improper] ⇒ p(θ|y) ≈ normal(θ|¯y ,σn2)

Scale parameter σ > 0

One possibility: p(σ) ∝ 1 [improper]

Another possibility: p(logσ2) ∝ 1 corresponds to p(σ2) ∝ σ−2 [improper]

(3)

Noninformative prior distributions: Jeffrey’s principle

φ = h(θ), p(φ) = p(θ)|d φd θ| = p(θ) |h0(θ)|−1

Jeffrey’s principle leads to a non informative prior density:

p(θ) ∝ [J(θ)]1/2, where J(θ) is the Fisher information for θ:

J(θ) = E

"

 dlogp(y |θ) d θ

2

#

= −E d2logp(y |θ) d θ2



Jeffrey’s prior model isinvariant to parameterization, evaluate J(φ) at θ = h−1(φ):

J(φ) = −E d2logp(y |φ) d φ2



= −E

"

d2logp(y |θ = h−1(φ)) d θ2

d θ d φ

2#

= J(θ)

d θ d φ

2

;

thus, J(φ)1/2= J(θ)1/2|d θd φ|

3 of 17

(4)

Examples: Various noninformative prior distributions

y |θ ∼ binomial(n, θ), p(y |θ) = nyy(1 − θ)n−y

Jeffrey’s prior density p(θ) ∝ [J(θ)]1/2:

logp(y |θ) = constant + y logθ + (n − y )log(1 − θ).

J(θ) = −E d2logp(y |θ) d θ2



= n

θ(1 − θ) Jeffreys0prior ⇒ p(θ) ∝ θ−1/2(1 − θ)−1/2.

Three alternatives of prior

Jeffreys’ prior: θ ∼ Beta(12,12)

uniform prior: θ ∼ Beta(1, 1), i.e., p(θ) = 1

improper prior: θ ∼ Beta(0, 0) i.e., p(logθ) ∝ 1

(5)

From single-parameter to multiparameter models

The reality of applied statistics: there are always several (maybe many) unknown parameters!

BUT the interest usually lies in only a few of these (parameters of interest) while others are regarded as nuisance parameters for which we have no interest in making inferences but which are required in order to construct a realistic model.

At this point the simple conceptual framework of the Bayesian approach reveals its principal advantage over other forms of inference.

5 of 17

(6)

Bayesian approach to multiparameter models

The Bayesian approach is clear: Obtain the joint posterior distribution of all unknowns, then integrate over the nuisance parameters to leave the marginal posterior distribution for the parameters of interest.

Alternatively using simulation, draw samples from the entire joint posterior distribution (even this may be computationally difficult), look at the parameters of interest and ignore the rest.

(7)

Parameter of interest and nuisance parameter

Suppose model parameter θ has two parts θ = (θ1, θ2)

Parameter of interest: θ1

Nuisance parameter: θ2

For example

y |µ, σ2 ∼ normal(µ, σ2),

Unknown: µ and σ2

Parameter of interest (usually, not always): µ

Nuisance parameter: σ2

Approach to obtain p(θ1|y )

Averaging over nuisance parameters

Factoring the joint posterior

A strategy for computation: Conditional simulation via Gibbs sampler

7 of 17

(8)

Posterior distribution of θ = (θ

1

, θ

2

)

Prior of θ:

p(θ) = p(θ1, θ2)

Likelihood of θ:

p(y |θ) = p(y |θ1, θ2)

Posterior of θ = (θ1, θ2) given y :

p(θ1, θ2|y ) ∝ p(θ1, θ2)p(y |θ1, θ2).

(9)

Approaches to obtain marginal posterior of θ

1

, p(θ

1

|y )

Joint posterior of θ1 and θ2: p(θ1, θ2|y ) ∝ p(θ1, θ2)p(y |θ1, θ2)

Approaches to obtain marginal posterior density p(θ1|y )

By averaging or integrating over the nuisance parameter θ2: p(θ1|y ) =

Z

p(θ1, θ2|y )d θ2.

By factoring the joint posterior:

p(θ1|y ) = Z

p(θ1, θ2|y )d θ2

= Z

p(θ12, y )p(θ2|y )d θ2. (1)

p(θ1|y ) is a mixture of the conditional posterior distributions given the nuisance parameter θ2, p(θ12, y ).

The weighting function p(θ2|y ) combines evidence from data and prior.

θ2can be categorical (discrete) and may take only a few possible values representing, for example, different sub-models.

9 of 17

(10)

A strategy for computation: Simulations instead of integration

We rarely evaluate integral (1) explicitly, but it suggests an important strategy for constructing and computing with multiparameter models, using simulations.

Successive conditional simulations

Draw θ2from its marginal posterior distribution, p(θ2|y ).

Draw θ1from conditional posterior distribution given the drawn value of θ2, p(θ12, y ).

All-Others conditional simulations (Gibbs sampler)

Draw θ(t+1)1 from conditional posterior distribution given the previous drawn value of θ2(t), p(θ1(t)2 , y ).

Draw θ(t+1)2 from conditional posterior distribution given the drawn value of θ(t)1 , p(θ21(t), y ).

(11)

Multiparameter model: the normal model (I)

y1, · · · , yniid∼ normal(µ, σ2), both µ and σ2 unknown, use Bayesian approach to estimate µ.

choose a prior for (µ, σ2), take noninformative priors:

p(µ, σ2) = p(µ)p(σ2) ∝ 1 · (σ2)−1= σ−2

priorindependenceof location and scale

p(µ) ∝ 1: noninformative or uniform but improper prior

p(logσ2) ∝ 1 ⇒ p(σ2) ∝ (σ2)−1: noninformative or uniform on logσ2

likelihood:

p(y|µ, σ2) =

n

Y

i =1

1 2πσexp



1

2(yi− µ)2



σ−nexp 1 2(

n

X

i =1

(yi− µ)2

!

11 of 17

(12)

Joint posterior distribution, p(µ, σ

2

|y)

y1, · · · , yniid∼ normal(µ, σ2)

prior of (µ, σ2): p(µ, σ2) = p(µ)p(σ2) ∝ 1 · (σ2)−1= σ−2

find the joint posterior distribution of (µ, σ2):

p(µ, σ2|y) p(µ, σ2)p(y|µ, σ2)

σ−n−2exp 1 2(

n

X

i =1

(yi− µ)2

!

= σ−n−2exp 1 2(

n

X

i =1

(yi− ¯y )2+ n(¯y − µ)2

!

= σ−n−2exp



1

2[(n − 1)s2+ n(¯y − µ)2]

 . where s2= n−11 Pn

i =1(yi− ¯y )2, the sample variance. The sufficient

(13)

Conditional posterior distribution, p(µ|σ

2

, y)

p(µ, σ2|y) = p(µ|σ2, y)p(σ2|y)

Use the case with single parameter µ with known σ2 and non informative prior p(µ) ∝ 1, we have

p(µ|σ2, y) ∼ normal(¯y ,σ2 n ).

13 of 17

(14)

Marginal posterior distribution, p(σ

2

|y)

p(µ, σ2|y) = p(µ|σ2, y)p(σ2|y)

p(σ2|y) requires averaging the joint distribution

p(µ, σ2|y) ∝ σ−n−2exp −12[(n − 1)s2+ n(¯y − µ)2] over µ, that is, evaluating the simple normal integral

Z exp



− 1

2n(¯y − µ)2

 d µ =

r2πσ2 n , thus,

p(σ2|y) ∝ (σ2)−(n+1)/2exp



−(n − 1)s22



σ2|y ∼ Inv − χ2(n − 1, s2),

(15)

Analytic form of marginal posterior distribution of µ

µ is typically the estimand of interest, so ultimate objective of the Bayesian analysis is the marginal posterior distribution of µ. This can be obtained by integrating σ2 out of the joint posterior

distribution. Easily done by simulation: first draw σ2 from p(σ2|y), then draw µ from p(µ|σ2, y).

The posterior distribution of µ, p(µ|y), can be thought of as a mixture of normal distributions mixed over the scaled inverse chi-squared distribution for the variance - a rare case where analytic results are available.

15 of 17

(16)

Performing the integration

We start by integrating the joint posterior density over σ2 p(µ|y) =

Z 0

p(µ, σ2|y)d σ2

With the substitution z = A2, A = (n − 1)s2+ n(µ − ¯y )2, the result is an unnormalized gamma integral:

p(µ|y) ∝ A−n/2 Z

0

z(n−2)/2exp(−z)dz

∝ [(n − 1)s2+ n(µ − ¯y )2]−n/2



1 +n(µ − ¯y )2 (n − 1)s2

−n/2

(17)

Parallel between Bayesian & Frequentist results

σ2: Bayes (under noninformative prior on logσ2, p(σ2) ∝ (σ2)−1) versus Frequentist:

(n − 1)s2

σ2 |y ∼ χ2n−1vs. (n − 1)s2

σ2 |µ, σ2 ∼ χ2n−1

µ: Bayes (under noninformative prior on (µ, logσ2), p(µ, σ2) ∝ (σ2)−1) versus Frequentist:

µ − ¯y s/√

n|y ∼ tn−1vs. y − µ¯ s/√

n|µ, σ2∼ tn−1.

where the ratio s/¯y −µn is called a pivotal quantity : Its sampling distribution does not depend on the nuisance parameter σ2.

17 of 17

參考文獻

相關文件

With the help of the pictures and the words below, write a journal entry about what happened.. Write at least

The molal-freezing-point-depression constant (Kf) for ethanol is 1.99 °C/m. The density of the resulting solution is 0.974 g/mL.. 21) Which one of the following graphs shows the

Hope theory: A member of the positive psychology family. Lopez (Eds.), Handbook of positive

Joint “ “AMiBA AMiBA + Subaru + Subaru ” ” data, probing the gas/DM distribution data, probing the gas/DM distribution out to ~80% of the cluster. out to ~80% of the cluster

Using this formalism we derive an exact differential equation for the partition function of two-dimensional gravity as a function of the string coupling constant that governs the

We propose a primal-dual continuation approach for the capacitated multi- facility Weber problem (CMFWP) based on its nonlinear second-order cone program (SOCP) reformulation.. The

◦ In Frequentist statistics, parameters are fixed, and we think of properties of estimation methods in repeated sampling, that is, when we imagine taking many random samples from

◦ Lack of fit of the data regarding the posterior predictive distribution can be measured by the tail-area probability, or p-value of the test quantity. ◦ It is commonly computed