Introduction to Bayesian Statistics
Lecture 9: Hierarchical Models
Rung-Ching Tsai
Department of Mathematics National Taiwan Normal University
May 6, 2015
Example
• Data: Weekly weights of 30 young rats (Gelfand, Hills, Racine-Poon, & Smith, 1990).
Day
8 15 22 29 36
Rat 1 151 199 246 283 320 Rat 2 145 199 249 293 354
· · ·
Rat 30 153 200 244 286 324
• Model:
Yij = α + βxj + ij,
where Yij: weight of i -th rat on day xj; ij ∼ Normal(0, σ2)
• What is the assumption on the growth of the 30 rats in this model?
2 of 22
Example
• Data: Number of Failures and length of operation time of 10 power plant pumps (George, Makov, & Smith, 1993).
Pump 1 2 3 4 5 6 7 8 9 10
time 94.5 15.7 62.9 126 5.24 31.4 1.05 1.05 2.1 10.5
failure 5 1 5 14 3 19 1 1 4 22
• Model:
Xij ∼ Poisson(λti)
where Xij is the number of power failures, λ is the failure rate, and ti is the length of operation time of pump i (in 1000s of hours).
• What is the assumption on the failure rates of the 10 power plant pumps in this model?
Possible problems with above approaches
• A single (α, β) may be inadequate to fit all the rats. Likewise, a common failure rate for all the power plant pumps may not be suitable.
• Separate unrelated (αi, βi) for each rat, or λi for each pump are likely to overfit the data. Some information about the parameters of one rat or one pump can be obtained from others’ data.
4 of 22
Motivation for hierarchical models
• A thought naturally arises by assuming that (αi, βi)’s or λi’s are samples from a common population distribution. The distribution of observed outcomes are conditional on
parameters which themselves have a probability specification, known as a hierarchical or multilevel model.
• The new parameters introduced to govern the population distribution of the parameters are called hyperparameters.
• Thus, we would need to estimate the parameters governing the population distribution of (αi, βi) rather than each (αi, βi) separately.
Bayesian approach to hierarchical models
• Model specification
◦ specify the sampling distribution of data: p(y |θ)
◦ specify the population distribution of θ: p(θ|φ) where φ is the hyperparameter
• Bayesian estimation
◦ specify the prior for hyperparameter: p(φ); Many levels are possible.
The hyperprior distribution at highest level is often chosen to be non-informative
◦ consider the above model specification: p(y |θ) and p(θ|φ)
◦ find the joint posterior distribution of parameter θ and hyperparameter φ:
p(θ, φ|y ) ∝ p(θ, φ)p(y |θ, φ) = p(θ, φ)p(y |θ)
∝ p(φ)p(θ|φ)p(y |θ)
◦ Point and Credible interval estimations for φ and θ
◦ Predictive distribution for ˜y
6 of 22
Analytical derivation of conditional/marginal dist.
• Write put the joint posterior distribution:
p(θ, φ|y ) ∝ p(φ)p(θ|φ)p(y |θ)
• Determine analytically the conditional posterior density of θ given φ: p(θ|φ, y )
• Obtain the marginal posterior distribution of φ:
p(φ|y ) = Z
p(θ, φ|y )d θ or
p(φ|y ) = p(θ, φ|y ) p(θ|φ, y ).
Simulations from the posterior distributions
1. Two steps to simulate a random draw from the joint posterior distribution of θ and φ: p(θ, φ|y )
◦ Draw φ from its marginal posterior distribution: p(φ|y )
◦ Draw parameter θ from its conditional posterior p(θ|φ, y )
2. If desired, draw predictive values ˜y from the posterior predictive distribution given the drawn θ
8 of 22
Example: Rat tumors
• Goal: Estimating the risk of tumor in a group of rats
• Data (number of rats developed some kind of tumor):
1. 70 historical experiments:
0/20 0/20 0/20 0/20 0/20 0/20 0/20 0/19 0/19 0/19 0/19 0/18 0/18 0/17 1/20 1/20 1/20 1/20 1/19 1/19 1/18 1/18 2/25 2/24 2/23 2/20 2/20 2/20 2/20 2/20 2/20 1/10 5/49 2/19 5/46 3/27 2/17 7/49 7/47 3/20 3/20 2/13 9/48 10/50 4/20 4/20 4/20 4/20 4/20 4/20 4/20 10/48 4/19 4/19 4/19 5/22 11/46 12/49 5/20 5/20 6/23 5/19 6/22 6/20 6/20 6/20 16/52 15/47 15/46 9/24 2. Current experiment: 4/14
Bayesian approach to hierarchical models
• Model specification
◦ sampling distribution of data: yj∼ binomial(nj, θj), j = 1, 2, · · · , 71.
◦ the population distribution of θ: θj ∼ Beta(α, β) where α and β are the hyperparameters.
• Bayesian estimation
◦ non-informative prior for hyperparameters: p(α, β)
◦ consider the above model specification: p(θ|α, β)
◦ find the joint posterior distribution of parameter θ and hyperparameters α and β:
p(θ, α, β|y) ∝ p(α, β)p(θ|α, β)p(y|θ, α, β)
∝ p(α, β)
J
Y
j =1
Γ(α + β)
Γ(α)Γ(β)θjα−1(1 − θj)β−1
J
Y
j =1
θyji(1 − θj)nj−yj
10 of 22
Analytical derivation of conditional/marginal dist.
• the joint posterior distribution:
p(θ, α, β|y) ∝ p(α, β)
J
Y
j =1
Γ(α + β)
Γ(α)Γ(β)θjα−1(1 − θj)β−1
J
Y
j =1
θyji(1 − θj)nj−yj
• the conditional posterior density of θ given α and β:
p(θ|α, β, y) =
J
Y
j =1
Γ(α + β + nj)
Γ(α + yj)Γ(β + nj − yj)θjα+yj−1(1 − θj)β+nj−yj−1
• the marginal posterior distribution of α and β:
p(α, β|y) = p(θ, α, β|y)
p(θ|α, β, y) ∝ p(α, β)
J
YΓ(α + β) Γ(α)Γ(β)
Γ(α + yj)Γ(β + nj− yj) Γ(α + β + nj)
Choice of hyperprior distribution
• Idea: To set up a ‘non-informative’ hyperprior distribution
◦ p
logit(α+βα ) = log(αβ), log(α + β)
∝ 1
NO GOODbecause it leads to improper posterior.
◦ p
α
α+β, α + β
∝ 1 or p(α, β) ∝ 1
NO GOODbecause the posterior density is not integrable in the limit.
◦ p
α
α + β, (α + β)−1/2
∝ 1 ⇐⇒ p(α, β) ∝ (α + β)−5/2
⇐⇒ p
log(α
β), log(α + β)
∝ αβ(α + β)−5/2
OK because it leads to proper posterior.
12 of 22
Computing marginal posterior of the hyperparameters
• Computing the relative (unnormalized) posterior density on a grid of values that cover the effective range of (α, β)
◦
log(αβ), log(α + β)
∈ [−1, −2.5] × [1.5, 3]
◦
log(αβ), log(α + β)
∈ [−1.3, −2.3] × [1, 5]
• Drawingcontour plot of the marginal density of
log(αβ), log(α + β)
◦ contour lines are at 0.05, 0.15, · · · , 0.95 times the density at the mode.
• Normalizing by approximating the posterior distribution as a step function over a grid and setting total probability in the grid to 1.
• Computing the posterior moments based on the grid of (log(αβ), log(α + β)). For example, E(α|y) is estimated by
X = αp(log(α
β), log(α + β)|y)
Sampling from the joint posterior
1. Simulation 1000 draws of (log(αβ), log(α + β)) from their posterior distribution using the discrete-grid sampling procedure.
2. For l = 1, · · · , 1000
◦ Transform the l -th draw of (log(αβ), log(α + β)) to the scale of (α, β) to yield a draw of the hyperparameters from their marginal posterior distribution.
◦ For each j = 1, · · · , J, sample θj from its conditional posterior distribution θj|α, β, y ∼ Beta(α + yj, β + nj− yj).
14 of 22
Displaying the results
• Plot the posterior means and 95% intervals for the θj’s (Figure 5.4 on page 131)
• Rate θj’s are shrunk from their sample point estimates, ynj
j, towards the population distribution, with approximate mean.
• Experiment with few observation are shrunk more and have higher posterior variances.
• Note that posterior variability is higher in the full Bayesian analysis, reflecting posterior uncertainty in the hyperparameters.
Hierarchical normal models (I)
• Model specification
◦ Sampling distribution of data:
yij|θj ∼ Normal(θj, σ2), i = 1, · · · , nj, j = 1, 2, · · · , J. σ2known
◦ the population distribution of θ: θj ∼ Normal(µ, τ2) where µ and τ are the hyperparameters. That is,
p(θ1, · · · , θJ|µ, τ ) =
J
Y
j =1
N(θj|µ, τ2)
◦
p(θ1, · · · , θJ) = Z J
Y
j =1
[N(θj|µ, τ2)]p(µ, τ )d (µ, τ ).
16 of 22
Hierarchical normal models (II)
• Bayesian estimation
◦ non-informative prior for hyperparameters:
p(µ, τ ) = p(µ|τ )p(τ ) ∝ p(τ )
◦ consider the above model specification: p(θ|µ, τ )
◦ find the joint posterior distribution of parameter θ and hyperparameters µ and τ :
p(θ, µ, τ |y) ∝ p(µ, τ )p(θ|µ, τ )p(y|θ)
∝ p(µ, τ )
J
Y
j =1
N(θj|µ, τ2)
J
Y
j =1
N(¯y.j|θj, σ2/nj)
Conditional posterior of θ given (µ, τ ), p(θ|µ, τ, y)
•
θj|µ, τ ∼ Normal(µ, τ2),
•
θj|µ, τ, y ∼ Normal(ˆθj, Vj), where
◦
θˆj=
nj
σ2¯y.j+τ12µ
nj
σ2 +τ12
◦
Vj = 1
nj σ2+τ12
18 of 22
Marginal posterior of µ and τ , p(µ, τ |y)
p(µ, τ |y) ∝ p(µ, τ )p(y|µ, τ )
¯
y.j|µ, τ ∼ Normal(µ,σ2 nj
+ τ2) Therefore,
p(µ, τ |y) ∝ p(µ, τ )
J
Y
j =1
N(¯y.j|µ,σ2 nj + τ2)
Posterior of µ given τ , p(µ|τ, y)
p(µ, τ |y) = p(µ|τ, y)p(τ |y)
⇒ p(µ|τ, y) = p(µ, τ |y) p(τ |y) Therefore,
µ|τ, y ∼ Normal(ˆµ, Vµ), where
ˆ µ =
PJ j =1 1
σ2 nj+τ2y¯.j PJ
j =1 1
σ2 nj+τ2
and Vµ−1 =
J
X
j =1
1
σ2 nj + τ2
20 of 22
Posterior distribution of τ , p(τ |y)
p(τ |y) = p(µ, τ |y) p(µ|τ, y
∝ p(τ )QJ
j =1N(¯y.j|µ,σn2
j + τ2) N(µ|ˆµ, Vµ)
∝ p(τ )QJ
j =1N(¯y.j|ˆµ,σn2
j + τ2) N(ˆµ|ˆµ, Vµ)
∝ p(τ )Vµ1/2
J
Y
j =1
(σ2 nj
+ τ2)−1/2exp
−(¯y.j− ˆµ)2 2(σn2
j + τ2)
Prior distribution of τ , p(τ )
p(τ |y) = p(µ, τ |y) p(µ|τ, y
∝ p(τ )QJ
j =1N(¯y.j|µ,σn2
j + τ2) N(µ|ˆµ, Vµ)
∝ p(τ )QJ
j =1N(¯y.j|ˆµ,σn2
j + τ2) N(ˆµ|ˆµ, Vµ)
∝ p(τ )Vµ1/2
J
Y
j =1
(σ2 nj
+ τ2)−1/2exp
−(¯y.j− ˆµ)2 2(σn2
j + τ2)
22 of 22