Introduction to Bayesian Statistics Lecture 9: Hierarchical Models

(1)

Introduction to Bayesian Statistics

Lecture 9: Hierarchical Models

Rung-Ching Tsai

Department of Mathematics National Taiwan Normal University

May 6, 2015

(2)

Example

• Data: Weekly weights of 30 young rats (Gelfand, Hills, Racine-Poon, & Smith, 1990).

Day

8 15 22 29 36

Rat 1 151 199 246 283 320 Rat 2 145 199 249 293 354

· · ·

Rat 30 153 200 244 286 324

• Model:

Yij = α + βxj + ij,

where Yij: weight of i -th rat on day xj; ij ∼ Normal(0, σ²)

• What is the assumption on the growth of the 30 rats in this model?

2 of 22

(3)

Example

• Data: Number of Failures and length of operation time of 10 power plant pumps (George, Makov, & Smith, 1993).

Pump 1 2 3 4 5 6 7 8 9 10

time 94.5 15.7 62.9 126 5.24 31.4 1.05 1.05 2.1 10.5

failure 5 1 5 14 3 19 1 1 4 22

• Model:

X_ij ∼ Poisson(λt_i)

where X_ij is the number of power failures, λ is the failure rate, and ti is the length of operation time of pump i (in 1000s of hours).

• What is the assumption on the failure rates of the 10 power plant pumps in this model?

(4)

Possible problems with above approaches

• A single (α, β) may be inadequate to fit all the rats. Likewise, a common failure rate for all the power plant pumps may not be suitable.

• Separate unrelated (α_i, β_i) for each rat, or λ_i for each pump are likely to overfit the data. Some information about the parameters of one rat or one pump can be obtained from others’ data.

4 of 22

(5)

Motivation for hierarchical models

• A thought naturally arises by assuming that (α_i, β_i)’s or λ_i’s are samples from a common population distribution. The distribution of observed outcomes are conditional on

parameters which themselves have a probability specification, known as a hierarchical or multilevel model.

• The new parameters introduced to govern the population distribution of the parameters are called hyperparameters.

• Thus, we would need to estimate the parameters governing the population distribution of (αi, βi) rather than each (αi, βi) separately.

(6)

Bayesian approach to hierarchical models

• Model specification

◦ specify the sampling distribution of data: p(y |θ)

◦ specify the population distribution of θ: p(θ|φ) where φ is the hyperparameter

• Bayesian estimation

◦ specify the prior for hyperparameter: p(φ); Many levels are possible.

The hyperprior distribution at highest level is often chosen to be non-informative

◦ consider the above model specification: p(y |θ) and p(θ|φ)

◦ find the joint posterior distribution of parameter θ and hyperparameter φ:

p(θ, φ|y ) ∝ p(θ, φ)p(y |θ, φ) = p(θ, φ)p(y |θ)

∝ p(φ)p(θ|φ)p(y |θ)

◦ Point and Credible interval estimations for φ and θ

◦ Predictive distribution for ˜y

6 of 22

(7)

Analytical derivation of conditional/marginal dist.

• Write put the joint posterior distribution:

p(θ, φ|y ) ∝ p(φ)p(θ|φ)p(y |θ)

• Determine analytically the conditional posterior density of θ given φ: p(θ|φ, y )

• Obtain the marginal posterior distribution of φ:

p(φ|y ) = Z

p(θ, φ|y )d θ or

p(φ|y ) = p(θ, φ|y ) p(θ|φ, y ).

(8)

Simulations from the posterior distributions

1. Two steps to simulate a random draw from the joint posterior distribution of θ and φ: p(θ, φ|y )

◦ Draw φ from its marginal posterior distribution: p(φ|y )

◦ Draw parameter θ from its conditional posterior p(θ|φ, y )

2. If desired, draw predictive values ˜y from the posterior predictive distribution given the drawn θ

8 of 22

(9)

Example: Rat tumors

• Goal: Estimating the risk of tumor in a group of rats

• Data (number of rats developed some kind of tumor):

1. 70 historical experiments:

0/20 0/20 0/20 0/20 0/20 0/20 0/20 0/19 0/19 0/19 0/19 0/18 0/18 0/17 1/20 1/20 1/20 1/20 1/19 1/19 1/18 1/18 2/25 2/24 2/23 2/20 2/20 2/20 2/20 2/20 2/20 1/10 5/49 2/19 5/46 3/27 2/17 7/49 7/47 3/20 3/20 2/13 9/48 10/50 4/20 4/20 4/20 4/20 4/20 4/20 4/20 10/48 4/19 4/19 4/19 5/22 11/46 12/49 5/20 5/20 6/23 5/19 6/22 6/20 6/20 6/20 16/52 15/47 15/46 9/24 2. Current experiment: 4/14

(10)

Bayesian approach to hierarchical models

◦ sampling distribution of data: y_j∼ binomial(n_j, θ_j), j = 1, 2, · · · , 71.

◦ the population distribution of θ: θ_j ∼ Beta(α, β) where α and β are the hyperparameters.

◦ non-informative prior for hyperparameters: p(α, β)

◦ consider the above model specification: p(θ|α, β)

◦ find the joint posterior distribution of parameter θ and hyperparameters α and β:

p(θ, α, β|y) ∝ p(α, β)p(θ|α, β)p(y|θ, α, β)

∝ p(α, β)

J

Y

j =1

Γ(α + β)

Γ(α)Γ(β)θ_j^α−1(1 − θ_j)^β−1

J

Y

j =1

θ^y_jⁱ(1 − θ_j)ⁿ^j^−y^j

10 of 22

(11)

Analytical derivation of conditional/marginal dist.

• the joint posterior distribution:

p(θ, α, β|y) ∝ p(α, β)

J

Y

j =1

Γ(α + β)

Γ(α)Γ(β)θ_j^α−1(1 − θj)^β−1

J

Y

j =1

θ^y_jⁱ(1 − θj)ⁿ^j^−y^j

• the conditional posterior density of θ given α and β:

p(θ|α, β, y) =

J

Y

j =1

Γ(α + β + nj)

Γ(α + y_j)Γ(β + n_j − y_j)θ_j^α+y^j⁻¹(1 − θj)^β+n^j^−y^j⁻¹

• the marginal posterior distribution of α and β:

p(α, β|y) = p(θ, α, β|y)

p(θ|α, β, y) ∝ p(α, β)

J

YΓ(α + β) Γ(α)Γ(β)

Γ(α + yj)Γ(β + nj− yj) Γ(α + β + n_j)

(12)

Choice of hyperprior distribution

• Idea: To set up a ‘non-informative’ hyperprior distribution

◦ p

logit(_α+β^α ) = log(^α_β), log(α + β)

∝ 1

NO GOODbecause it leads to improper posterior.

◦ p

α

α+β, α + β

∝ 1 or p(α, β) ∝ 1

NO GOODbecause the posterior density is not integrable in the limit.

◦ p

α

α + β, (α + β)^−1/2

∝ 1 ⇐⇒ p(α, β) ∝ (α + β)^−5/2

⇐⇒ p

log(α

β), log(α + β)

∝ αβ(α + β)^−5/2

OK because it leads to proper posterior.

12 of 22

(13)

Computing marginal posterior of the hyperparameters

• Computing the relative (unnormalized) posterior density on a grid of values that cover the effective range of (α, β)

◦

log(^α_β), log(α + β)

∈ [−1, −2.5] × [1.5, 3]

◦

∈ [−1.3, −2.3] × [1, 5]

• Drawingcontour plot of the marginal density of

◦ contour lines are at 0.05, 0.15, · · · , 0.95 times the density at the mode.

• Normalizing by approximating the posterior distribution as a step function over a grid and setting total probability in the grid to 1.

• Computing the posterior moments based on the grid of (log(^α_β), log(α + β)). For example, E(α|y) is estimated by

X = αp(log(α

β), log(α + β)|y)

(14)

Sampling from the joint posterior

1. Simulation 1000 draws of (log(^α_β), log(α + β)) from their posterior distribution using the discrete-grid sampling procedure.

2. For l = 1, · · · , 1000

◦ Transform the l -th draw of (log(^α_β), log(α + β)) to the scale of (α, β) to yield a draw of the hyperparameters from their marginal posterior distribution.

◦ For each j = 1, · · · , J, sample θ_j from its conditional posterior distribution θj|α, β, y ∼ Beta(α + yj, β + nj− yj).

14 of 22

(15)

Displaying the results

• Plot the posterior means and 95% intervals for the θj’s (Figure 5.4 on page 131)

• Rate θj’s are shrunk from their sample point estimates, ^y_n^j

j, towards the population distribution, with approximate mean.

• Experiment with few observation are shrunk more and have higher posterior variances.

• Note that posterior variability is higher in the full Bayesian analysis, reflecting posterior uncertainty in the hyperparameters.

(16)

Hierarchical normal models (I)

◦ Sampling distribution of data:

y_ij|θ_j ∼ Normal(θ_j, σ²), i = 1, · · · , n_j, j = 1, 2, · · · , J. σ²known

◦ the population distribution of θ: θj ∼ Normal(µ, τ²) where µ and τ are the hyperparameters. That is,

p(θ1, · · · , θJ|µ, τ ) =

J

Y

j =1

N(θj|µ, τ²)

◦

p(θ1, · · · , θJ) = Z ^J

Y

j =1

[N(θj|µ, τ²)]p(µ, τ )d (µ, τ ).

16 of 22

(17)

Hierarchical normal models (II)

◦ non-informative prior for hyperparameters:

p(µ, τ ) = p(µ|τ )p(τ ) ∝ p(τ )

◦ consider the above model specification: p(θ|µ, τ )

◦ find the joint posterior distribution of parameter θ and hyperparameters µ and τ :

p(θ, µ, τ |y) ∝ p(µ, τ )p(θ|µ, τ )p(y|θ)

∝ p(µ, τ )

J

Y

j =1

N(θj|µ, τ²)

J

Y

j =1

N(¯y.j|θj, σ²/nj)

(18)

Conditional posterior of θ given (µ, τ ), p(θ|µ, τ, y)

•

θ_j|µ, τ ∼ Normal(µ, τ²),

•

θ_j|µ, τ, y ∼ Normal(ˆθ_j, V_j), where

◦

θˆ_j=

nj

σ²¯y_.j+_τ¹₂µ

nj

σ² +_τ¹₂

◦

Vj = 1

n_j σ²+_τ¹2

18 of 22

(19)

Marginal posterior of µ and τ , p(µ, τ |y)

p(µ, τ |y) ∝ p(µ, τ )p(y|µ, τ )

¯

y_.j|µ, τ ∼ Normal(µ,σ² nj

+ τ²) Therefore,

p(µ, τ |y) ∝ p(µ, τ )

J

Y

j =1

N(¯y_.j|µ,σ² n_j + τ²)

(20)

Posterior of µ given τ , p(µ|τ, y)

p(µ, τ |y) = p(µ|τ, y)p(τ |y)

⇒ p(µ|τ, y) = p(µ, τ |y) p(τ |y) Therefore,

µ|τ, y ∼ Normal(ˆµ, Vµ), where

ˆ µ =

PJ j =1 1

σ2 nj+τ²y¯_.j PJ

j =1 1

σ2 nj+τ²

and V_µ⁻¹ =

J

X

j =1

1

σ² nj + τ²

20 of 22

(21)

Posterior distribution of τ , p(τ |y)

p(τ |y) = p(µ, τ |y) p(µ|τ, y

∝ p(τ )QJ

j =1N(¯y.j|µ,^σ_n²

j + τ²) N(µ|ˆµ, Vµ)

∝ p(τ )QJ

j =1N(¯y_.j|ˆµ,^σ_n²

j + τ²) N(ˆµ|ˆµ, V_µ)

∝ p(τ )V_µ^1/2

J

Y

j =1

(σ² nj

+ τ²)^−1/2exp



−(¯y_.j− ˆµ)² 2(^σ_n²

j + τ²)





(22)

Prior distribution of τ , p(τ )

p(τ |y) = p(µ, τ |y) p(µ|τ, y

∝ p(τ )QJ

j =1N(¯y.j|µ,^σ_n²

j + τ²) N(µ|ˆµ, Vµ)

∝ p(τ )QJ

j =1N(¯y_.j|ˆµ,^σ_n²

j + τ²) N(ˆµ|ˆµ, V_µ)

∝ p(τ )V_µ^1/2

J

Y

j =1

(σ² nj

+ τ²)^−1/2exp



−(¯y_.j− ˆµ)² 2(^σ_n²

j + τ²)





22 of 22