Introduction to Bayesian Statistics Lecture 3: Single Parameter (II)

(1)

Introduction to Bayesian Statistics

Lecture 3: Single Parameter (II)

Rung-Ching Tsai

Department of Mathematics National Taiwan Normal University

March 11, 2015

(2)

Conjugate Prior Distributions

Definition of Conjugacy: If F is a class of sampling distributions p(y |θ), and P is a class of prior distributions for θ, then the classP is conjugate for F if

p(θ|y ) ∈ P for all p(·|θ) ∈ F and p(θ) ∈ P.

• Advantages of using conjugate priors:

◦ computational convenience

◦ being interpretable as additional data

• Example: Beta is conjugate for binomial with θ ∼ Beta(α, β) and θ|y ∼ Beta(α + y , β + n − y ).

• Exercise: What is the conjugate prior for Poisson(λ)?

2 of 14

(3)

Conjugate Prior Distributions for exponential families

Definition: The class F is an exponential family if all its members have the form

p(y_i|θ) = f (y_i)g (θ)e^φ(θ)^T^u(yⁱ⁾, where φ(θ): the “natural parameter” of the family F .

Exercise: Show that the binomial(n, θ) is an exponential family with natural parameter logit(θ), and the conjugate prior on θ are Beta distributions.

3 of 14

(4)

Conjugate Prior Distributions for exponential families

• Likelihood of θ:

p(y|θ) =

" _n Y

i =1

f (y_i)

#

g (θ)ⁿexp φ(θ)^T

n

X

i =1

u(y_i))

!

∝ g (θ)ⁿexp

φ(θ)^Tt(y)

, where t(y) =Pn

i =1u(y_i): sufficient statistic for θ

• (Conjugate) Prior:

p(θ) ∝ g (θ)^ηexp

φ(θ)^Tν

• Posterior:

p(θ|y) ∝ g (θ)^η+nexp

φ(θ)^T(ν + t(y) .

• Known fact: Exponential families are, in general, the only classes of distributions that have natural conjugate priors.

4 of 14

(5)

Single Parameter θ: Continuous y

• y ∼ normal(θ, σ²), σ² known, use Bayesian approach to estimate θ.

◦ choose a conjugate prior for θ, p(θ) = e^Aθ²^+Bθ+C, such that p(θ) ∝ exp

− 1

2τ₀²(θ − µ0)²

◦ likelihood of θ: p(y |θ) = ^√¹

2πσexp −_2σ¹2(y − θ)²

◦ find the posterior distribution of θ:

p(θ|y ) ∝ p(θ)p(y |θ) ∝ exp

−1 2

(y − θ)²

σ² +(θ − µ0)² τ₀²

∝ exp

− 1

2τ₁²(θ − µ1)²

, that is, θ|y ∼ normal(µ1, τ₁²), where

µ₁=

1 τ 20

µ₀+¹

σ2y

1 τ 20

+¹

σ2

and _τ¹2 1

=_τ¹2 0

+_σ¹₂.

5 of 14

(6)

Single Parameter θ: Continuous y

• θ ∼ normal(µ0, τ₀²), y ∼ normal(θ, σ²) ⇒ θ|y ∼ normal(µ1, τ₁²)

• posterior precision _τ¹2 1

◦ Definition ofprecision: the inverse of variance

◦ _τ¹2 1

= _τ¹2 0

+_σ¹2, i.e., the posterior precision equals the prior precision plus the data precision.

• posterior mean µ1

◦ µ1=

1 τ 20

µ₀+¹

σ2y

1 τ 20

+¹

σ2

, i.e., the posterior mean is a weighted average of the prior mean and the observed value y , with weights proportional to the precision.

◦ the prior mean adjusted toward the observed y : µ1= µ0+ (y − µ0)_σ₂^τ_+τ⁰² 2

0

.

◦ a compromise between the prior mean and the observed data y , with data shrunk toward the prior mean: µ₁= y − (y − µ₀)_σ₂^σ_+τ² ₂

0 6 of 14

(7)

Single Parameter θ: Continuous y

Posterior predictive distribution p(˜y |y ) p(˜y |y ) =

Z

p(˜y |θ)p(θ|y )d θ

∝ Z

exp

− 1

2σ²(˜y − θ)²

exp

− 1

2τ₁²(θ − µ1)²

d θ

• y |y ∼ normal(?, ?)˜

• E(˜y |y ) = E(E(˜y |θ, y )|y ) = E(θ|y ) = µ₁

• var(˜y |y ) = E(var(˜y |θ, y )|y ) + var(E(˜y |θ, y )|y ) = E(σ²|y ) + var(θ|y ) = σ²+ τ₁².

Note. E(˜y |θ) = θ, var(˜y |θ) = σ²

7 of 14

(8)

Single Parameter θ: Continuous y = (y

₁

, · · · , y

_n

)

• y₁, · · · , y_n^iid∼ normal(θ, σ²), σ² known, use Bayesian approach to estimate θ.

◦ choose a conjugate prior for θ, p(θ) ∝ exp

−_2τ¹2 0

(θ − µ0)²

◦ likelihood of θ: p(y|θ) =Qn i =1

√1

2πσexp −_2σ¹2(yi− θ)²

◦ find the posterior distribution of θ:

p(θ|y ) ∝ p(θ)p(y|θ) ∝ exp

−1 2

Pⁿ

i =1(yi− θ)²

σ² +(θ − µ0)² τ₀²

∝ exp

− 1

2τ_n²(θ − µn)²

,

that is, θ|y ∼ normal(µn, τ_n²), where µn=

1 τ 20

µ₀+ⁿ

σ2¯y

1 τ 20

+ⁿ

σ2

and _τ¹2 n =_τ¹2

0

+_σⁿ2.

8 of 14

(9)

Single Parameter θ: Continuous y = (y

₁

, · · · , y

_n

)

• y1, · · · , yniid

∼ normal(θ, σ²), σ² known, θ ∼ normal(µ0, τ₀²)

⇒ θ|y ∼ normal(µ_n, τ_n²)

• posterior precision _τ¹2 n = _τ¹2

0

+_σⁿ2; posterior mean µ_n=

1 τ 20

µ0+ⁿ

σ2y¯

1 τ 20

+ⁿ

σ2

◦ If n is large, the posterior distribution is largely determined by σ²and the sample value ¯y .

◦ As τ0→ ∞ with n fixed, or as n → ∞ with τ₀²fixed, we have

p(θ|y) ≈ normal(θ|¯y ,σ² n).

◦ Compare the well-known result of classical statistics:

¯

y |θ, σ²∼ normal(θ,^σ_n²) leads to the use ofy ± 1.96¯ ^√^σ_n as a 95%

confidence interval for θ.

◦ Bayesian approach gives the same result for noninformative prior.

9 of 14

(10)

Exercise

A random sample of n students is drawn from a large population, and their weights are measured. The average weight of the n sampled students is ¯y = 150 pounds. Assume the weights in the population are normally distributed with unknown mean θ and known standard deviation 20 pounds. Suppose your prior distribution for θ is normal with mean 180 and standard deviation 40.

(a) Give your posterior distribution for θ.

(b) A new student is sampled at random from the same population and has a weight of ˜y pounds. Give a posterior predictive distribution for ˜y .

(c) For n = 10, give a 95% posterior interval for θ and a 95% posterior predictive interval for ˜y .

(d) Do the same for n = 100.

10 of 14

(11)

Single Parameter σ

²

: Continuous y = (y

₁

, · · · , y

_n

)

• y₁, · · · , y_n^iid∼ normal(θ, σ²), θ known, use Bayesian approach to estimate σ².

◦ likelihood of σ²:

p(y|σ²) =

n

Y

i =1

√1 2πσexp

− 1

2σ²(y_i− θ)²

∝ σ⁻ⁿexp

− 1

2σ²(y_i− θ)²

= (σ²)⁻ⁿ²exp(− n 2σ²v ) where v =_n¹Pn

i =1(yi− θ)²

◦ choose a conjugate prior for σ²(inverse-gamma):

p(σ²) ∝ (σ²)^−(α+1)e⁻^σ2^β

11 of 14

(12)

Single Parameter σ

²

: Continuous y = (y

₁

, · · · , y

_n

)

• y₁, · · · , y_n^iid∼ normal(θ, σ²), θ known, estimate σ².

◦ likelihood of σ²: p(y|σ²) = (σ²)⁻ⁿ²exp(−_2σⁿ₂v )²

◦ choose a conjugate prior for σ²(inverse-gamma):

p(σ²) ∝ (σ²)^−(α+1)e⁻^σ2^β, i.e.,σ²∼ Inv-χ²(ν0, σ₀²)

Note. A scaled inverse-χ²distribution with scale σ²₀ and ν₀degrees of freedom: ^σ⁰_X²^ν⁰ ∼ χ²_ν₀, i.e., X ∼ Inv-χ²(ν0, σ₀²)

◦ find the posterior distribution of σ²: p(σ²) ∝ p(σ²)p(y|σ²)

∝ σ²₀ σ²

^ν0/2+1

exp

−σ²₀ν₀ 2σ²

· (σ²)⁻ⁿ²exp(−n 2

v σ²)

∝ (σ²)^−((n+ν⁰^)/2+1)exp

− 1

2σ²(ν₀σ²₀+ nv )

. that is,σ²|y ∼ Inv-χ²

ν0+ n,^ν⁰_ν^σ²⁰^+nv

0+n

.

12 of 14

(13)

Homework II

1. The following Table gives the number of fatal accidents and deaths on scheduled airline flights per year over a ten-year period.

Year Fatal Passenger Death Year Fatal Passenger Death

accidents death rate accidents death rate

1976 24 734 0.19 1981 21 362 0.06

1977 25 516 0.12 1982 26 764 0.13

1978 31 754 0.15 1983 20 809 0.13

1979 31 877 0.16 1984 16 223 0.03

1980 22 814 0.14 1985 22 1066 0.15

(a) Assume that the number of fatal accidents in each year are independent with a Poisson(θ) distribution. Set a prior distribution for θ and determine the posterior distribution based on the data from 1976 through 1985. Under this model, give a 95% predictive interval for the number of fatal accident in 1986. You can use normal approximation to the gamma and Poisson or compute using simulation.

(b) Repeat (a) above, replacing ‘fatal accidents’ with ‘passenger deaths’.

13 of 14

(14)

Homework II

2. Censored and uncensored data in the exponential model:

(a) Suppose y |θ is exponentially distributed with rate θ, and the marginal (prior) distribution of θ is Gamma(α, β). Suppose we observe that y ≥ 100, but do not observe the exact value of y . What is the posterior distribution, p(θ|y ≥ 100), as a function of α and β? Write down the posterior mean and variance of θ.

(b) In the above problem, suppose that we are now told that y is exactly 100. Now what are the posterior mean and variance of θ?

14 of 14