Introduction to Bayesian Statistics Lecture 11: Model Comparison

(1)

Introduction to Bayesian Statistics

Lecture 11: Model Comparison

Rung-Ching Tsai

Department of Mathematics National Taiwan Normal University

May 20, 2015

(2)

Evaluating and Comparing Models

• Measure of predictive accuracy

◦ log predictive density as a measure of fit

◦ Out-of-sample predictive accuracy as a gold standard

• deviance, information criteria and cross-validation

◦ Within-sample predictive accuracy

◦ Subtracting an adjustment

◦ Cross-validation

• Model comparison based on predictive performance

• Model comparison based on Bayes factor

(3)

Akaike information criterion (AIC)

• elpdd_AIC = logp(y |ˆθmle) − k

• elpd= expected log predictive density

• Based on fit to observed data given maximum likelihood estimates

• Goal: use expected log predictive density (elpd) such that elpd = Ey˜[logp(˜y |ˆθ_mle)]

◦ expectation averages over the predictive distribution of ˜y

◦ AIC began life with Akaikes (1973) theorem, which established that AIC is an unbiased estimator of predictive accuracy.

(4)

deviance

What is the ‘deviance’ ?

• For a likelihood p(y|θ), we define the deviance as D(y, θ) = −2logp(y|θ)

e.g. Y₁, Y₂, · · · , Y_n∼ Binomial(n_i, θ_i), the deviance is

−2[X

i

yilogθi + (ni − y_i)log(1θi) + logn_i y_i

]

• It is possible to have a negative deviance. Deviance is derived from the likelihood and evaluated at a certain point in parameter space.

Likelihoods greater than 1 could lead to negative deviance, and are appropriate.

(5)

mean deviance as measure of fit

• Dempster (1974) suggested plotting posterior distribution of deviance D = −2logp(y|θ)

• Use of posterior mean deviance ¯D = E[D] as a measure of fit

• Invariant to parameterization of θ

• Robust, generally converges well

• But more complex models will fit the data better and so will have smaller ¯D

• Need to have some measure of model complexity to trade off against ¯D

(6)

counting parameters and model complexity-p

_D⁽¹⁾

• Bayesian measures of model complexity (Spiegelhalter et al, 2002) E_θ|y[−2logp(y|θ)] − (−2logp(y|˜θ)) = E_θ|y[D(y, θ)] − D(y, ˜θ).

where ˜θ = E[θ|y], then the measure is defined as posterior mean deviance - deviance of posterior means.

• the measure of effective number of parameters of a Bayesian model p_D⁽¹⁾ = Eˆ_θ|y[D(y, θ)] − D(y, ˜θ). = ˆDavg(y) − Dθˆ(y)

= 1

L

X

l =1

(D(y, θ^l) − D_θ_ˆ(y)).

(7)

counting parameters and model complexity-p

_D⁽²⁾

• A related way to measure model complexity is as half the posterior variance of the model-level deviance, its estimate is known as p_D⁽²⁾ (Gelman et al, 2004)

p⁽²⁾_D = varˆ _θ|y[D(y, θ)]/2

= 1

2 1 L − 1

L

X

l =1

(D(y, θ^l) − ˆDavg(y))²

(8)

comparison of p

_D⁽¹⁾

and p

_D⁽²⁾

• p_D⁽¹⁾is not invariant to reparameterization (subject of much criticism).

• In normal linear hierarchical models: p_D⁽¹⁾= tr (H) where Hy = ˆy . Hence H is the hat matrix which projects data onto fitted values. Thus p_D⁽¹⁾=P

ih_ii =P leverages. In general, justification depends on asymptotic normality of posterior distribution.

• p_D⁽¹⁾or p_D⁽²⁾, can be thought of as the number of ’unconstrained’

parameters in the model, where a parameter counts as: 1 if it is estimated with no constraints or prior information; 0 if it is fully constrained or if all the information about the parameter comes from the prior distribution; or an intermediate value if both the data and the prior are informative.

• p_D⁽¹⁾and p⁽²⁾_D should be positive. A negative p_D⁽¹⁾ value indicates one or more problems: log-likelihood is non-concave, a conflict between the prior and the data, or that the posterior mean is a poor estimator (such as with a bimodal posterior).

(9)

Deviance information criterion (DIC)

• use criterion based on trade-off between the fit of the data to the model and the corresponding complexity of the model

• Spiegelhalter et al (2002) proposed a Bayesian model comparison criterion based on this principle:

Deviance Information Criterion, DIC = goodness of fit + complexity

• elpdd_DIC = logp(y|ˆθBayes) − pDIC

• Based on fit to observed data given posterior mean

• Effective number of parameters p_DIC computed based on normal approximation (χ² approximation to -2 log likelihood): p⁽¹⁾_D or p_D⁽²⁾

• Either p_D⁽¹⁾ or p_D⁽²⁾ is asymptotically ok in expectation

(10)

Model comparison-using DIC

• The DIC is then defined analagously to AIC as

DIC = D(ˆθ_Bayes) + 2p_D⁽¹⁾ = ¯D + p_D⁽¹⁾ or DIC = ¯D + p_D⁽²⁾

• DIC may be compared across different models and even different methods, as long as the dependent variable does not change between models, making DIC the most flexible model fit statistic.

• Like AIC and BIC, DIC is an asymptotic approximation as the sample size becomes large. DIC is valid only when the joint posterior distribution is approximately multivariate normal.

• Models with smaller DIC should be preferred . Since DIC increases with model complexity (p⁽¹⁾_D or p_D⁽²⁾), simpler models are preferred.

(11)

How do I compare different DICs?

• The model with the minimum DIC estimates will make the best short-term predictions, in the same spirit as Akaike’s criterion.

• It is difficult to say what would constitute an important difference in DIC. Very roughly,

◦ differences of more than 10 might definitely rule out the model with the higher DIC.

◦ differences between 5 and 10 are substantial

◦ if the difference in DIC is, say, less than 5, and the models make very different inferences, then it could be misleading just to report the model with the lowest DIC.

(12)

Watanabe-Akaike information criterion (WAIC)

• elppd[_WAIC = (Pn

i =1logppost(yi)) − pWAIC

• elppd = expected log posterior predictive density

• Based on posterior predictive fit to observed data

• p_WAIC =Pn

i =1var_post(logp(y_i|θ))

• Compute ppost and varpost using simulations

• Requires data partition

• Connection to leave-one-out cross-validation

(13)

Model comparison-Bayes factor

Comparing two or more models:

p(H2|y )

p(H1|y ) = p(H2) p(H1)

p(y |H2) p(y |H1)

• ^p(H²⁾

p(H1) is “prior odds”

• B[H₂: H₁] = ^{p(y |H}_{p(y |H}²⁾

1) is “Bayes factor” with p(y |H) =

Z

p(y |θ, H)p(θ|H)d θ.

• Problem with p(y |H)

◦ Integral depends on irrelevant tail properties of the prior density

◦ Consider ¯y ∼ N(θ, σ²/n) and p(θ) ∝ U(−A, A), for some large A

◦ Marginal p(y ) is proportional to ¹

(14)

An example where the Bayes factor is good

• Genetics example with

H₁: the woman is affected, θ = 1 H₂: the woman is unaffected, θ = 0

◦ prior odds are p(H2)/p(H1) = 1

◦ Bayes factor of the data is p(y |H2)/p(y |H1) = 1.0/0.25 = 4

◦ the posterior odds are thus p(H2|y )/p(H1|y ) = 4

• The two features that allow Bayes factors to be helpful.

◦ each of the discrete alternatives makes scientific sense, and there are no obvious scientific models in between; i.e., truly discrete parameter space

◦ Model of probabilities; no unbounded parameters

(15)

An example where the Bayes factor is bad

• 8 schools example: yj ∼ N(θ_j, σ²_j), for j = 1, . . . , 8.

H1: no pooling, p(θ1, · · · , θ8) ∝ 1

H2: complete pooling, θ1 = . . . = θ_J = θ, p(θ) ∝ 1

◦ Bayes factor is 0/0

◦ Instead, express flat priors as N(0, A²) and let A get large

◦ Now Bayes factor strongly depends on A

◦ As A → ∞, complete pooling model gets 100% of the probability for any data!

◦ Also a horrible dependence on J

(16)

Interpretation of Bayes Factors

• Jeffreys (1961) and Kass & Raftery (1995)

2log (B[H2 : H1]) B[H2 : H1] Favor H2 over H1

0 to 2 1 to 3 Not worth a bare mention

2 to 6 3 to 20 Positive

6 to 10 30 to 150 Strong

> 10 > 150 Very Strong

• B[H₂: H₁] = 1/B[H₁: H₂]

• Interpretation is on same scale as deviance and likelihood ratio statistics