• 沒有找到結果。

Introduction to Bayesian Statistics Lecture 11: Model Comparison

N/A
N/A
Protected

Academic year: 2022

Share "Introduction to Bayesian Statistics Lecture 11: Model Comparison"

Copied!
16
0
0

加載中.... (立即查看全文)

全文

(1)

Introduction to Bayesian Statistics

Lecture 11: Model Comparison

Rung-Ching Tsai

Department of Mathematics National Taiwan Normal University

May 20, 2015

(2)

Evaluating and Comparing Models

Measure of predictive accuracy

log predictive density as a measure of fit

Out-of-sample predictive accuracy as a gold standard

deviance, information criteria and cross-validation

Within-sample predictive accuracy

Subtracting an adjustment

Cross-validation

Model comparison based on predictive performance

Model comparison based on Bayes factor

(3)

Akaike information criterion (AIC)

elpddAIC = logp(y |ˆθmle) − k

elpd= expected log predictive density

Based on fit to observed data given maximum likelihood estimates

Goal: use expected log predictive density (elpd) such that elpd = Ey˜[logp(˜y |ˆθmle)]

expectation averages over the predictive distribution of ˜y

AIC began life with Akaikes (1973) theorem, which established that AIC is an unbiased estimator of predictive accuracy.

(4)

deviance

What is the ‘deviance’ ?

For a likelihood p(y|θ), we define the deviance as D(y, θ) = −2logp(y|θ)

e.g. Y1, Y2, · · · , Yn∼ Binomial(ni, θi), the deviance is

−2[X

i

yilogθi + (ni − yi)log(1θi) + logni yi

 ]

It is possible to have a negative deviance. Deviance is derived from the likelihood and evaluated at a certain point in parameter space.

Likelihoods greater than 1 could lead to negative deviance, and are appropriate.

(5)

mean deviance as measure of fit

Dempster (1974) suggested plotting posterior distribution of deviance D = −2logp(y|θ)

Use of posterior mean deviance ¯D = E[D] as a measure of fit

Invariant to parameterization of θ

Robust, generally converges well

But more complex models will fit the data better and so will have smaller ¯D

Need to have some measure of model complexity to trade off against ¯D

(6)

counting parameters and model complexity-p

D(1)

Bayesian measures of model complexity (Spiegelhalter et al, 2002) Eθ|y[−2logp(y|θ)] − (−2logp(y|˜θ)) = Eθ|y[D(y, θ)] − D(y, ˜θ).

where ˜θ = E[θ|y], then the measure is defined as posterior mean deviance - deviance of posterior means.

the measure of effective number of parameters of a Bayesian model pD(1) = Eˆθ|y[D(y, θ)] − D(y, ˜θ). = ˆDavg(y) − Dθˆ(y)

= 1

L

L

X

l =1

(D(y, θl) − Dθˆ(y)).

(7)

counting parameters and model complexity-p

D(2)

A related way to measure model complexity is as half the posterior variance of the model-level deviance, its estimate is known as pD(2) (Gelman et al, 2004)

p(2)D = varˆ θ|y[D(y, θ)]/2

= 1

2 1 L − 1

L

X

l =1

(D(y, θl) − ˆDavg(y))2

(8)

comparison of p

D(1)

and p

D(2)

pD(1)is not invariant to reparameterization (subject of much criticism).

In normal linear hierarchical models: pD(1)= tr (H) where Hy = ˆy . Hence H is the hat matrix which projects data onto fitted values. Thus pD(1)=P

ihii =P leverages. In general, justification depends on asymptotic normality of posterior distribution.

pD(1)or pD(2), can be thought of as the number of ’unconstrained’

parameters in the model, where a parameter counts as: 1 if it is estimated with no constraints or prior information; 0 if it is fully constrained or if all the information about the parameter comes from the prior distribution; or an intermediate value if both the data and the prior are informative.

pD(1)and p(2)D should be positive. A negative pD(1) value indicates one or more problems: log-likelihood is non-concave, a conflict between the prior and the data, or that the posterior mean is a poor estimator (such as with a bimodal posterior).

(9)

Deviance information criterion (DIC)

use criterion based on trade-off between the fit of the data to the model and the corresponding complexity of the model

Spiegelhalter et al (2002) proposed a Bayesian model comparison criterion based on this principle:

Deviance Information Criterion, DIC = goodness of fit + complexity

elpddDIC = logp(y|ˆθBayes) − pDIC

Based on fit to observed data given posterior mean

Effective number of parameters pDIC computed based on normal approximation (χ2 approximation to -2 log likelihood): p(1)D or pD(2)

Either pD(1) or pD(2) is asymptotically ok in expectation

(10)

Model comparison-using DIC

The DIC is then defined analagously to AIC as

DIC = D(ˆθBayes) + 2pD(1) = ¯D + pD(1) or DIC = ¯D + pD(2)

DIC may be compared across different models and even different methods, as long as the dependent variable does not change between models, making DIC the most flexible model fit statistic.

Like AIC and BIC, DIC is an asymptotic approximation as the sample size becomes large. DIC is valid only when the joint posterior distribution is approximately multivariate normal.

Models with smaller DIC should be preferred . Since DIC increases with model complexity (p(1)D or pD(2)), simpler models are preferred.

(11)

How do I compare different DICs?

The model with the minimum DIC estimates will make the best short-term predictions, in the same spirit as Akaike’s criterion.

It is difficult to say what would constitute an important difference in DIC. Very roughly,

differences of more than 10 might definitely rule out the model with the higher DIC.

differences between 5 and 10 are substantial

if the difference in DIC is, say, less than 5, and the models make very different inferences, then it could be misleading just to report the model with the lowest DIC.

(12)

Watanabe-Akaike information criterion (WAIC)

elppd[WAIC = (Pn

i =1logppost(yi)) − pWAIC

elppd = expected log posterior predictive density

Based on posterior predictive fit to observed data

pWAIC =Pn

i =1varpost(logp(yi|θ))

Compute ppost and varpost using simulations

Requires data partition

Connection to leave-one-out cross-validation

(13)

Model comparison-Bayes factor

Comparing two or more models:

p(H2|y )

p(H1|y ) = p(H2) p(H1)

p(y |H2) p(y |H1)

p(H2)

p(H1) is “prior odds”

B[H2: H1] = p(y |Hp(y |H2)

1) is “Bayes factor” with p(y |H) =

Z

p(y |θ, H)p(θ|H)d θ.

Problem with p(y |H)

Integral depends on irrelevant tail properties of the prior density

Consider ¯y ∼ N(θ, σ2/n) and p(θ) ∝ U(−A, A), for some large A

Marginal p(y ) is proportional to 1

(14)

An example where the Bayes factor is good

Genetics example with

H1: the woman is affected, θ = 1 H2: the woman is unaffected, θ = 0

prior odds are p(H2)/p(H1) = 1

Bayes factor of the data is p(y |H2)/p(y |H1) = 1.0/0.25 = 4

the posterior odds are thus p(H2|y )/p(H1|y ) = 4

The two features that allow Bayes factors to be helpful.

each of the discrete alternatives makes scientific sense, and there are no obvious scientific models in between; i.e., truly discrete parameter space

Model of probabilities; no unbounded parameters

(15)

An example where the Bayes factor is bad

8 schools example: yj ∼ N(θj, σ2j), for j = 1, . . . , 8.

H1: no pooling, p(θ1, · · · , θ8) ∝ 1

H2: complete pooling, θ1 = . . . = θJ = θ, p(θ) ∝ 1

Bayes factor is 0/0

Instead, express flat priors as N(0, A2) and let A get large

Now Bayes factor strongly depends on A

As A → ∞, complete pooling model gets 100% of the probability for any data!

Also a horrible dependence on J

(16)

Interpretation of Bayes Factors

Jeffreys (1961) and Kass & Raftery (1995)

2log (B[H2 : H1]) B[H2 : H1] Favor H2 over H1

0 to 2 1 to 3 Not worth a bare mention

2 to 6 3 to 20 Positive

6 to 10 30 to 150 Strong

> 10 > 150 Very Strong

B[H2: H1] = 1/B[H1: H2]

Interpretation is on same scale as deviance and likelihood ratio statistics

參考文獻

相關文件

If in addition, the updated model preserves the large number of unupdated eigenpairs of the original model, the model is said to be updated with no spill-over.. In this talk, we

Indicate, if any, where it is increasing/decreasing, where it concave upward/downward, all relative maxima/minima, inflection points and asymptotic line(s) (if

Indicate, if any, where it is increasing/decreasing, where it concaves upward/downward, all relative maxima/minima, inflection points and asymptotic line(s) (if

(18%) Suppose that in the following week you have 12 hours each day to study for the final exams of Calculus 4 and English.. Let C be the number of hours per day spent studying

(10 points) A right circular cone is inscribed in a larger right circular cone so that its vertex is at the center of the base of the larger one.. Denote the height of the large cone

Or sometimes we simply said that a divisor is a canonical divisor if it is in the linear equivalent

If x or F is a vector, then the condition number is defined in a similar way using norms and it measures the maximum relative change, which is attained for some, but not all

In addition, to incorporate the prior knowledge into design process, we generalise the Q(Γ (k) ) criterion and propose a new criterion exploiting prior information about