• 沒有找到結果。

Introduction to Bayesian Statistics Lecture 9: Hierarchical Models

N/A
N/A
Protected

Academic year: 2022

Share "Introduction to Bayesian Statistics Lecture 9: Hierarchical Models"

Copied!
22
0
0

加載中.... (立即查看全文)

全文

(1)

Introduction to Bayesian Statistics

Lecture 9: Hierarchical Models

Rung-Ching Tsai

Department of Mathematics National Taiwan Normal University

May 6, 2015

(2)

Example

Data: Weekly weights of 30 young rats (Gelfand, Hills, Racine-Poon, & Smith, 1990).

Day

8 15 22 29 36

Rat 1 151 199 246 283 320 Rat 2 145 199 249 293 354

· · ·

Rat 30 153 200 244 286 324

Model:

Yij = α + βxj + ij,

where Yij: weight of i -th rat on day xj; ij ∼ Normal(0, σ2)

What is the assumption on the growth of the 30 rats in this model?

2 of 22

(3)

Example

Data: Number of Failures and length of operation time of 10 power plant pumps (George, Makov, & Smith, 1993).

Pump 1 2 3 4 5 6 7 8 9 10

time 94.5 15.7 62.9 126 5.24 31.4 1.05 1.05 2.1 10.5

failure 5 1 5 14 3 19 1 1 4 22

Model:

Xij ∼ Poisson(λti)

where Xij is the number of power failures, λ is the failure rate, and ti is the length of operation time of pump i (in 1000s of hours).

What is the assumption on the failure rates of the 10 power plant pumps in this model?

(4)

Possible problems with above approaches

A single (α, β) may be inadequate to fit all the rats. Likewise, a common failure rate for all the power plant pumps may not be suitable.

Separate unrelated (αi, βi) for each rat, or λi for each pump are likely to overfit the data. Some information about the parameters of one rat or one pump can be obtained from others’ data.

4 of 22

(5)

Motivation for hierarchical models

A thought naturally arises by assuming that (αi, βi)’s or λi’s are samples from a common population distribution. The distribution of observed outcomes are conditional on

parameters which themselves have a probability specification, known as a hierarchical or multilevel model.

The new parameters introduced to govern the population distribution of the parameters are called hyperparameters.

Thus, we would need to estimate the parameters governing the population distribution of (αi, βi) rather than each (αi, βi) separately.

(6)

Bayesian approach to hierarchical models

Model specification

specify the sampling distribution of data: p(y |θ)

specify the population distribution of θ: p(θ|φ) where φ is the hyperparameter

Bayesian estimation

specify the prior for hyperparameter: p(φ); Many levels are possible.

The hyperprior distribution at highest level is often chosen to be non-informative

consider the above model specification: p(y |θ) and p(θ|φ)

find the joint posterior distribution of parameter θ and hyperparameter φ:

p(θ, φ|y ) p(θ, φ)p(y |θ, φ) = p(θ, φ)p(y |θ)

p(φ)p(θ|φ)p(y |θ)

Point and Credible interval estimations for φ and θ

Predictive distribution for ˜y

6 of 22

(7)

Analytical derivation of conditional/marginal dist.

Write put the joint posterior distribution:

p(θ, φ|y ) ∝ p(φ)p(θ|φ)p(y |θ)

Determine analytically the conditional posterior density of θ given φ: p(θ|φ, y )

Obtain the marginal posterior distribution of φ:

p(φ|y ) = Z

p(θ, φ|y )d θ or

p(φ|y ) = p(θ, φ|y ) p(θ|φ, y ).

(8)

Simulations from the posterior distributions

1. Two steps to simulate a random draw from the joint posterior distribution of θ and φ: p(θ, φ|y )

Draw φ from its marginal posterior distribution: p(φ|y )

Draw parameter θ from its conditional posterior p(θ|φ, y )

2. If desired, draw predictive values ˜y from the posterior predictive distribution given the drawn θ

8 of 22

(9)

Example: Rat tumors

Goal: Estimating the risk of tumor in a group of rats

Data (number of rats developed some kind of tumor):

1. 70 historical experiments:

0/20 0/20 0/20 0/20 0/20 0/20 0/20 0/19 0/19 0/19 0/19 0/18 0/18 0/17 1/20 1/20 1/20 1/20 1/19 1/19 1/18 1/18 2/25 2/24 2/23 2/20 2/20 2/20 2/20 2/20 2/20 1/10 5/49 2/19 5/46 3/27 2/17 7/49 7/47 3/20 3/20 2/13 9/48 10/50 4/20 4/20 4/20 4/20 4/20 4/20 4/20 10/48 4/19 4/19 4/19 5/22 11/46 12/49 5/20 5/20 6/23 5/19 6/22 6/20 6/20 6/20 16/52 15/47 15/46 9/24 2. Current experiment: 4/14

(10)

Bayesian approach to hierarchical models

Model specification

sampling distribution of data: yj∼ binomial(nj, θj), j = 1, 2, · · · , 71.

the population distribution of θ: θj ∼ Beta(α, β) where α and β are the hyperparameters.

Bayesian estimation

non-informative prior for hyperparameters: p(α, β)

consider the above model specification: p(θ|α, β)

find the joint posterior distribution of parameter θ and hyperparameters α and β:

p(θ, α, β|y) p(α, β)p(θ|α, β)p(y|θ, α, β)

p(α, β)

J

Y

j =1

Γ(α + β)

Γ(α)Γ(β)θjα−1(1 − θj)β−1

J

Y

j =1

θyji(1 − θj)nj−yj

10 of 22

(11)

Analytical derivation of conditional/marginal dist.

the joint posterior distribution:

p(θ, α, β|y) ∝ p(α, β)

J

Y

j =1

Γ(α + β)

Γ(α)Γ(β)θjα−1(1 − θj)β−1

J

Y

j =1

θyji(1 − θj)nj−yj

the conditional posterior density of θ given α and β:

p(θ|α, β, y) =

J

Y

j =1

Γ(α + β + nj)

Γ(α + yj)Γ(β + nj − yjjα+yj−1(1 − θj)β+nj−yj−1

the marginal posterior distribution of α and β:

p(α, β|y) = p(θ, α, β|y)

p(θ|α, β, y) ∝ p(α, β)

J

YΓ(α + β) Γ(α)Γ(β)

Γ(α + yj)Γ(β + nj− yj) Γ(α + β + nj)

(12)

Choice of hyperprior distribution

Idea: To set up a ‘non-informative’ hyperprior distribution

p

logit(α+βα ) = log(αβ), log(α + β)

∝ 1

NO GOODbecause it leads to improper posterior.

p

α

α+β, α + β

∝ 1 or p(α, β) ∝ 1

NO GOODbecause the posterior density is not integrable in the limit.

p

 α

α + β, (α + β)−1/2



∝ 1 ⇐⇒ p(α, β) ∝ (α + β)−5/2

⇐⇒ p

 log(α

β), log(α + β)



∝ αβ(α + β)−5/2

OK because it leads to proper posterior.

12 of 22

(13)

Computing marginal posterior of the hyperparameters

Computing the relative (unnormalized) posterior density on a grid of values that cover the effective range of (α, β)



log(αβ), log(α + β)

∈ [−1, −2.5] × [1.5, 3]



log(αβ), log(α + β)

∈ [−1.3, −2.3] × [1, 5]

Drawingcontour plot of the marginal density of



log(αβ), log(α + β)



contour lines are at 0.05, 0.15, · · · , 0.95 times the density at the mode.

Normalizing by approximating the posterior distribution as a step function over a grid and setting total probability in the grid to 1.

Computing the posterior moments based on the grid of (log(αβ), log(α + β)). For example, E(α|y) is estimated by

X = αp(log(α

β), log(α + β)|y)

(14)

Sampling from the joint posterior

1. Simulation 1000 draws of (log(αβ), log(α + β)) from their posterior distribution using the discrete-grid sampling procedure.

2. For l = 1, · · · , 1000

Transform the l -th draw of (log(αβ), log(α + β)) to the scale of (α, β) to yield a draw of the hyperparameters from their marginal posterior distribution.

For each j = 1, · · · , J, sample θj from its conditional posterior distribution θj|α, β, y ∼ Beta(α + yj, β + nj− yj).

14 of 22

(15)

Displaying the results

Plot the posterior means and 95% intervals for the θj’s (Figure 5.4 on page 131)

Rate θj’s are shrunk from their sample point estimates, ynj

j, towards the population distribution, with approximate mean.

Experiment with few observation are shrunk more and have higher posterior variances.

Note that posterior variability is higher in the full Bayesian analysis, reflecting posterior uncertainty in the hyperparameters.

(16)

Hierarchical normal models (I)

Model specification

Sampling distribution of data:

yijj ∼ Normal(θj, σ2), i = 1, · · · , nj, j = 1, 2, · · · , J. σ2known

the population distribution of θ: θj ∼ Normal(µ, τ2) where µ and τ are the hyperparameters. That is,

p(θ1, · · · , θJ|µ, τ ) =

J

Y

j =1

N(θj|µ, τ2)

p(θ1, · · · , θJ) = Z J

Y

j =1

[N(θj|µ, τ2)]p(µ, τ )d (µ, τ ).

16 of 22

(17)

Hierarchical normal models (II)

Bayesian estimation

non-informative prior for hyperparameters:

p(µ, τ ) = p(µ|τ )p(τ ) ∝ p(τ )

consider the above model specification: p(θ|µ, τ )

find the joint posterior distribution of parameter θ and hyperparameters µ and τ :

p(θ, µ, τ |y) p(µ, τ )p(θ|µ, τ )p(y|θ)

p(µ, τ )

J

Y

j =1

N(θj|µ, τ2)

J

Y

j =1

N(¯y.jj, σ2/nj)

(18)

Conditional posterior of θ given (µ, τ ), p(θ|µ, τ, y)

θj|µ, τ ∼ Normal(µ, τ2),

θj|µ, τ, y ∼ Normal(ˆθj, Vj), where

θˆj=

nj

σ2¯y.j+τ12µ

nj

σ2 +τ12

Vj = 1

nj σ2+τ12

18 of 22

(19)

Marginal posterior of µ and τ , p(µ, τ |y)

p(µ, τ |y) ∝ p(µ, τ )p(y|µ, τ )

¯

y.j|µ, τ ∼ Normal(µ,σ2 nj

+ τ2) Therefore,

p(µ, τ |y) ∝ p(µ, τ )

J

Y

j =1

N(¯y.j|µ,σ2 nj + τ2)

(20)

Posterior of µ given τ , p(µ|τ, y)

p(µ, τ |y) = p(µ|τ, y)p(τ |y)

⇒ p(µ|τ, y) = p(µ, τ |y) p(τ |y) Therefore,

µ|τ, y ∼ Normal(ˆµ, Vµ), where

ˆ µ =

PJ j =1 1

σ2 nj2.j PJ

j =1 1

σ2 nj2

and Vµ−1 =

J

X

j =1

1

σ2 nj + τ2

20 of 22

(21)

Posterior distribution of τ , p(τ |y)

p(τ |y) = p(µ, τ |y) p(µ|τ, y

∝ p(τ )QJ

j =1N(¯y.j|µ,σn2

j + τ2) N(µ|ˆµ, Vµ)

∝ p(τ )QJ

j =1N(¯y.j|ˆµ,σn2

j + τ2) N(ˆµ|ˆµ, Vµ)

∝ p(τ )Vµ1/2

J

Y

j =1

2 nj

+ τ2)−1/2exp

−(¯y.j− ˆµ)2 2(σn2

j + τ2)

(22)

Prior distribution of τ , p(τ )

p(τ |y) = p(µ, τ |y) p(µ|τ, y

∝ p(τ )QJ

j =1N(¯y.j|µ,σn2

j + τ2) N(µ|ˆµ, Vµ)

∝ p(τ )QJ

j =1N(¯y.j|ˆµ,σn2

j + τ2) N(ˆµ|ˆµ, Vµ)

∝ p(τ )Vµ1/2

J

Y

j =1

2 nj

+ τ2)−1/2exp

−(¯y.j− ˆµ)2 2(σn2

j + τ2)

22 of 22

參考文獻

相關文件

fostering independent application of reading strategies Strategy 7: Provide opportunities for students to track, reflect on, and share their learning progress (destination). •

Strategy 3: Offer descriptive feedback during the learning process (enabling strategy). Where the

Now, nearly all of the current flows through wire S since it has a much lower resistance than the light bulb. The light bulb does not glow because the current flowing through it

Hope theory: A member of the positive psychology family. Lopez (Eds.), Handbook of positive

The left panel shows boxplots showing the 100 posterior predictive p values (PPP-values) for each observed raw score across the 100 simulated data sets generated from

For a polytomous item measuring the first-order latent trait, the item response function can be the generalized partial credit model (Muraki, 1992), the partial credit model

◦ Lack of fit of the data regarding the posterior predictive distribution can be measured by the tail-area probability, or p-value of the test quantity. ◦ It is commonly computed

There are existing learning resources that cater for different learning abilities, styles and interests. Teachers can easily create differentiated learning resources/tasks for CLD and